Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.

Citation

If you find this work useful, please consider citing:

@article{wang2026learning,
  title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
  author={Wang, Haibo and Huang, Lifu},
  journal={arXiv preprint arXiv:2606.05833},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for WHB139426/GeoVR