Abstract
RynnBrain is an open-source spatiotemporal foundation model for embodied intelligence that unifies perception, reasoning, and planning capabilities across multiple scales and task-specific variants.
Despite rapid progress in multimodal foundation models, embodied intelligence community still lacks a unified, physically grounded foundation model that integrates perception, reasoning, and planning within real-world spatial-temporal dynamics. We introduce RynnBrain, an open-source spatiotemporal foundation model for embodied intelligence. RynnBrain strengthens four core capabilities in a unified framework: comprehensive egocentric understanding, diverse spatiotemporal localization, physically grounded reasoning, and physics-aware planning. The RynnBrain family comprises three foundation model scales (2B, 8B, and 30B-A3B MoE) and four post-trained variants tailored for downstream embodied tasks (i.e., RynnBrain-Nav, RynnBrain-Plan, and RynnBrain-VLA) or complex spatial reasoning tasks (i.e., RynnBrain-CoP). In terms of extensive evaluations on 20 embodied benchmarks and 8 general vision understanding benchmarks, our RynnBrain foundation models largely outperform existing embodied foundation models by a significant margin. The post-trained model suite further substantiates two key potentials of the RynnBrain foundation model: (i) enabling physically grounded reasoning and planning, and (ii) serving as a strong pretrained backbone that can be efficiently adapted to diverse embodied tasks.
Community
🚀 We’re excited to release our paper and fully open-source RynnBrain — an embodied foundation model designed as a unified cognitive brain for real-world agents.
Unlike conventional VLMs that reason purely in text or static images, RynnBrain is explicitly grounded in physical space and time, integrating egocentric perception, spatiotemporal memory, physically grounded reasoning, and physics-aware planning in a single model.
🧠 What’s fundamentally new?
RynnBrain introduces a spatio-temporal foundation model for embodied intelligence, where reasoning is no longer detached from the physical world:
• Agents can remember object locations across time, not just within a single frame
• Reasoning is interleaved with spatial grounding (text ⇄ coordinates), reducing hallucination
• Planning outputs are directly executable, with objects, areas, affordances, and trajectories grounded in space
📈 Scale & training
We pretrain RynnBrain on ~20M high-quality embodied training pairs, spanning object cognition, spatial reasoning, grounding, trajectory prediction, and manipulation planning.
To make this feasible, we introduce RynnScale, a load-balanced spatiotemporal training framework that improves training efficiency by ~2× under the same compute budget, while preserving stability across dense and MoE models.
🏆 Strong empirical results
Across 20 embodied benchmarks + 8 general vision benchmarks, RynnBrain consistently outperforms existing embodied foundation models:
• Large gains in spatial reasoning, egocentric cognition, and fine-grained localization
• Competitive or superior performance to strong proprietary systems (e.g., Gemini Robotics ER-style models) under comparable settings
Post-trained variants further validate its versatility:
• RynnBrain-CoP: physically grounded chain-of-point reasoning
• RynnBrain-Nav: SOTA results on VLN benchmarks (R2R, RxR)
• RynnBrain-Plan: spatially explicit manipulation planning
• RynnBrain-VLA: stronger downstream vision-language-action execution
📊 We also introduce RynnBrain-Bench
Existing benchmarks fail to evaluate long-horizon spatiotemporal grounding.
RynnBrain-Bench fills this gap with 21 fine-grained embodied capabilities, covering object cognition, spatial cognition, grounding, and pointing across full episodic memory.
📦 Fully open-source release
• Model code & checkpoints (2B, 8B, MoE 30B-A3B)
• Complete training & fine-tuning framework
• RynnBrain-Bench benchmark suite
• Recipes for navigation, planning, and action workflows
🔮 Why this matters
Embodied intelligence needs more than language fluency — it needs memory, spatial grounding, and physical consistency.
RynnBrain is a step toward physically grounded general intelligence, offering a reproducible, extensible foundation for agents that perceive, remember, reason, and act in the real world.
arXivLens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/rynnbrain-open-embodied-foundation-models-821-0556da0b
- Executive Summary
- Detailed Breakdown
- Practical Applications