Overview
ACE-Brain-0 is a generalist multimodal foundation model designed to unify perception, reasoning, and decision-making across diverse embodied domains, including spatial cognition, autonomous driving, low-altitude sensing and embodied interaction. Built upon a unified multimodal large language model (MLLM) architecture, ACE-Brain learns a shared spatial reasoning substrate that enables generalization across heterogeneous physical environments and agent embodiments.
Extensive evaluation across 24 benchmarks demonstrates that ACE-Brain achieves state-of-the-art or competitive performance across multiple domains, validating its effectiveness as a unified embodied intelligence model.
Key Features
- Unified multimodal foundation model for embodied intelligence
- Strong spatial reasoning as a universal intelligence scaffold
- Supports diverse embodiment platforms:
- Spatial Cognition
- Autonomous Driving
- Low-Altitude Sensing
- Embodied Interaction
- Cross-domain generalization across perception, reasoning, and planning
Performance Highlights
ACE-Brain achieves strong performance across 24 benchmarks covering Spatial Cognition, Autonomous Driving, Low-Altitude Sensing and Embodied Interaction, consistently outperforming existing open-source embodied VLMs and remaining competitive with closed-source models.
The model shows robust capability in spatial reasoning, physical interaction understanding, task-oriented decision-making, and dynamic scene interpretation, enabling reliable performance across diverse real-world embodiment scenarios.
In driving and aerial domains, ACE-Brain demonstrates excellent performance in environment understanding, motion reasoning, and planning-aware prediction, highlighting its effectiveness in complex, large-scale, and safety-critical environments.
Despite its domain specialization, ACE-Brain maintains strong general multimodal reasoning ability, confirming that spatial intelligence-based training enhances overall visual-language intelligence rather than limiting generalization.
Spatial Benchmarks
Autonomous Driving Benchmarks
Low-Altitude Benchmarks
Embodied Benchmarks
Bold numbers indicate the best results, underlined numbers indicate the second-best results, and results marked with * are obtained using our evaluation framework.
Inference Example
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
"ACE-Brain/ACE-Brain-0-8B", dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("ACE-Brain/ACE-Brain-0-8B")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Citation
@misc{gong2026acebrain0spatialintelligenceshared,
title={ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments},
author={Ziyang Gong and Zehang Luo and Anke Tang and Zhe Liu and Shi Fu and Zhi Hou and Ganlin Yang and Weiyun Wang and Xiaofeng Wang and Jianbo Liu and Gen Luo and Haolan Kang and Shuang Luo and Yue Zhou and Yong Luo and Li Shen and Xiaosong Jia and Yao Mu and Xue Yang and Chunxiao Liu and Junchi Yan and Hengshuang Zhao and Dacheng Tao and Xiaogang Wang},
year={2026},
eprint={2603.03198},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.03198},
}
- Downloads last month
- 2
Model tree for ACE-Brain/ACE-Brain-0-8B
Unable to build the model tree, the base model loops to the model itself. Learn more.