Vega: Learning to Drive with Natural Language Instructions
Abstract
Vega is a unified Vision-Language-World-Action model that combines autoregressive and diffusion paradigms for instruction-based driving planning and trajectory generation.
Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.
Community
The shift from scene descriptions to instruction-following is a key evolution for embodied agents. Most VL-AM papers treat language as a static conditioning signal, but personal driving requires dynamic instruction interpretation. Curious if the InstructScene dataset includes multi-turn instruction sequences — e.g., "turn left at the light, then park near the cafe" — or if it's primarily single-step commands. For agentic workflows, the ability to chain instructions over time is critical, and I wonder if the trajectory prediction handles temporal instruction dependencies.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unifying Language-Action Understanding and Generation for Autonomous Driving (2026)
- Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion (2026)
- DriveWorld-VLA: Unified Latent-Space World Modeling with Vision-Language-Action for Autonomous Driving (2026)
- MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving (2026)
- AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving (2026)
- HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving (2026)
- BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.25741 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper