A Survey on Vision-Language-Action Models: An Action Tokenization Perspective Paper • 2507.01925 • Published Jul 2, 2025 • 39
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Paper • 2507.16746 • Published Jul 22, 2025 • 35
MolmoAct: Action Reasoning Models that can Reason in Space Paper • 2508.07917 • Published Aug 11, 2025 • 45
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies Paper • 2508.20072 • Published Aug 27, 2025 • 32
ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks Paper • 2508.15804 • Published Aug 14, 2025 • 15
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions Paper • 2509.06951 • Published Sep 8, 2025 • 33
VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model Paper • 2509.09372 • Published Sep 11, 2025 • 256
Lost in Embeddings: Information Loss in Vision-Language Models Paper • 2509.11986 • Published Sep 15, 2025 • 29
A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning Paper • 2509.15937 • Published Sep 19, 2025 • 20
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing Paper • 2509.22186 • Published Sep 26, 2025 • 164
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models Paper • 2509.25848 • Published Sep 30, 2025 • 81
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models Paper • 2510.13626 • Published Oct 15, 2025 • 48
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning Paper • 2510.12693 • Published Oct 14, 2025 • 28
GigaBrain-0: A World Model-Powered Vision-Language-Action Model Paper • 2510.19430 • Published Oct 22, 2025 • 53
π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models Paper • 2510.25889 • Published Oct 29, 2025 • 66
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution Paper • 2511.14210 • Published Nov 18, 2025 • 21
VisPlay: Self-Evolving Vision-Language Models from Images Paper • 2511.15661 • Published Nov 19, 2025 • 45
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models Paper • 2511.15605 • Published Nov 19, 2025 • 25
Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs Paper • 2511.19773 • Published Nov 24, 2025 • 10
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight Paper • 2511.16175 • Published Nov 20, 2025 • 12
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion Paper • 2512.19535 • Published Dec 22, 2025 • 12
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models Paper • 2512.19526 • Published Dec 22, 2025 • 12
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Paper • 2601.09708 • Published Jan 14 • 55
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery Paper • 2601.19325 • Published Jan 27 • 82
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation Paper • 2601.22153 • Published Jan 29 • 75
TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics Paper • 2602.19313 • Published Feb 22 • 26
EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots Paper • 2602.18071 • Published Feb 20 • 22
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders Paper • 2603.06569 • Published Mar 6 • 120
π-StepNFT: Wider Space Needs Finer Steps in Online RL for Flow-based VLAs Paper • 2603.02083 • Published Mar 2 • 9
VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning? Paper • 2603.07888 • Published Mar 9 • 10
Anatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language Models Paper • 2603.15557 • Published Mar 16 • 29
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models Paper • 2603.18002 • Published Mar 18 • 13
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining Paper • 2603.15030 • Published Mar 16 • 21
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning Paper • 2603.17024 • Published Mar 17 • 110
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model Paper • 2603.21986 • Published Mar 23 • 125
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs Paper • 2603.16932 • Published Mar 14 • 90
SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning Paper • 2603.22057 • Published Mar 23 • 46
VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors Paper • 2604.02486 • Published Apr 2 • 10
HY-Embodied-0.5: Embodied Foundation Models for Real-World Agents Paper • 2604.07430 • Published Apr 8 • 189
Let ViT Speak: Generative Language-Image Pre-training Paper • 2605.00809 • Published 29 days ago • 33
MolmoAct2: Action Reasoning Models for Real-world Deployment Paper • 2605.02881 • Published 26 days ago • 345
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models Paper • 2605.14906 • Published 16 days ago • 75
StableVLA: Towards Robust Vision-Language-Action Models without Extra Data Paper • 2605.18287 • Published 12 days ago • 15
GEM: Generative Supervision Helps Embodied Intelligence Paper • 2605.28548 • Published 3 days ago • 37
PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft Paper • 2605.27762 • Published 4 days ago
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Paper • 2605.30280 • Published 2 days ago • 73