Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.

GitHub: https://github.com/WHB139426/GeoVR-MLLM
Paper: https://arxiv.org/abs/2606.05833

Citation

If you find this work useful, please consider citing:

@article{wang2026learning,
  title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
  author={Wang, Haibo and Huang, Lifu},
  journal={arXiv preprint arXiv:2606.05833},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for WHB139426/GeoVR

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Paper • 2606.05833 • Published 2 days ago • 3