UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment

Overview

UNCHA is a hyperbolic vision-language model that improves part–whole compositional understanding by modeling semantic representativeness as uncertainty.

Unlike conventional vision-language models, UNCHA explicitly captures the fact that:

  • Not all parts contribute equally to representing a scene
  • Some regions (e.g., main objects) are more informative than others

To address this, UNCHA introduces uncertainty-aware alignment in hyperbolic space, enabling better hierarchical and compositional reasoning.


Download

from huggingface_hub import snapshot_download

repo_path = snapshot_download("hayeonkim/uncha")

print("Repo downloaded to:", repo_path)

Key Idea

UNCHA models part-to-whole semantic representativeness using uncertainty:

  • Low uncertainty β†’ highly representative part
  • High uncertainty β†’ less informative / noisy part

This uncertainty is integrated into:

  • Contrastive loss β†’ adaptive temperature scaling
  • Entailment loss β†’ calibrated hierarchical structure with entropy regularization

This leads to improved alignment in hyperbolic embedding space and stronger compositional reasoning.


Model Details

  • Architecture: Hyperbolic Vision-Language Model
  • Backbone: ViT-S/16 or ViT-B/16
  • Training data: GRIT dataset (20.5M pairs, 35.9M part annotations)

Performance

UNCHA achieves strong performance across multiple tasks:

Zero-shot classification (ViT-B/16)

Method ImageNet CIFAR-10 CIFAR-100 SUN397 Caltech-101 STL-10
CLIP 40.6 78.9 48.3 43.0 70.7 92.4
MERU 40.1 78.6 49.3 43.0 73.0 92.8
HyCoCLIP 45.8 88.8 60.1 57.2 81.3 95.0
UNCHA (Ours) 48.8 90.4 63.2 57.7 83.9 95.7

Multi-object representation (ViT-B/16, mAP)

Method ComCo 2obj ComCo 5obj SimCo 2obj SimCo 5obj VOC COCO
CLIP 77.55 80.22 77.15 88.48 78.56 53.94
HyCoCLIP 72.90 72.90 75.71 82.85 80.43 58.12
UNCHA (Ours) 77.92 81.18 79.72 90.65 82.14 59.43

Training

Training requires preprocessing GRIT dataset:

python utils/prepare_GRIT_webdataset.py \
    --raw_webdataset_path datasets/train/GRIT/raw \
    --processed_webdataset_path datasets/train/GRIT/processed

Then run:

./scripts/train.sh \
    --config configs/train_uncha_vit_b.py \
    --num-gpus 4

πŸ“ˆ Evaluation

Zero-shot classification

python scripts/evaluate.py \
    --config configs/eval_zero_shot_classification.py \
    --checkpoint-path /path/to/ckpt

Retrieval

python scripts/evaluate.py \
    --config configs/eval_zero_shot_retrieval.py \
    --checkpoint-path /path/to/ckpt

Citation

@inproceedings{kim2026uncha,
  author    = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
  title     = {UNCHA: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models},
  booktitle = {CVPR},
  year      = {2026},
}

Acknowledgements

This work is supported by IITP, NRF, MSIT, and Seoul National University programs. We also acknowledge prior works including MERU, HyCoCLIP, and ATMG.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for hayeonkim/uncha