UniScene3D / README.md
nielsr's picture
nielsr HF Staff
Add model card
723aa90 verified
|
raw
history blame
1.68 kB
metadata
pipeline_tag: image-feature-extraction

UniScene3D: Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding

UniScene3D is a transformer-based encoder that learns unified scene representations from multi-view colored pointmaps, jointly modeling image appearance and geometry. It extends pretrained CLIP models to learn representations that effectively combine complementary information from images and pointmaps, generalizing across diverse 3D scene understanding tasks.

Key Features

  • Unified Representation: Jointly encodes geometry and appearance from multi-view colored pointmaps within a single ViT encoder.
  • Novel Training Objectives: Introduces cross-view geometric alignment and grounded view alignment to enforce geometric and semantic consistency.
  • Versatile Performance: Demonstrates state-of-the-art performance in zero-shot, few-shot, and task-specific fine-tuning settings for tasks like viewpoint grounding, scene retrieval, and 3D VQA.

Citation

If you find this work useful, please cite:

@inproceedings{mao2026uniscene3d,
  title     = {Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding},
  author    = {Mao, Ye and Luo, Weixun and Huang, Ranran and Jing, Junpeng and Mikolajczyk, Krystian},
  booktitle = {arxiv},
  year      = {2026}
}