Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
Paper • 2606.05833 • Published • 3
This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.
If you find this work useful, please consider citing:
@article{wang2026learning,
title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
author={Wang, Haibo and Huang, Lifu},
journal={arXiv preprint arXiv:2606.05833},
year={2026}
}