UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment

Overview

UNCHA is a hyperbolic vision-language model that improves part–whole compositional understanding by modeling semantic representativeness as uncertainty.

Unlike conventional vision-language models, UNCHA explicitly captures the fact that:

Not all parts contribute equally to representing a scene
Some regions (e.g., main objects) are more informative than others

To address this, UNCHA introduces uncertainty-aware alignment in hyperbolic space, enabling better hierarchical and compositional reasoning.

Project Page: https://jeeit17.github.io/UNCHA-project_page/
Paper: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Code: https://github.com/jeeit17/UNCHA

Download

from huggingface_hub import snapshot_download

repo_path = snapshot_download("hayeonkim/uncha")

print("Repo downloaded to:", repo_path)

Key Idea

UNCHA models part-to-whole semantic representativeness using uncertainty:

Low uncertainty → highly representative part
High uncertainty → less informative / noisy part

This uncertainty is integrated into:

Contrastive loss → adaptive temperature scaling
Entailment loss → calibrated hierarchical structure with entropy regularization

This leads to improved alignment in hyperbolic embedding space and stronger compositional reasoning.

Model Details

Architecture: Hyperbolic Vision-Language Model
Backbone: ViT-S/16 or ViT-B/16
Training data: GRIT dataset (20.5M pairs, 35.9M part annotations)

Performance

UNCHA achieves strong performance across multiple tasks:

Zero-shot classification (ViT-B/16)

Method	ImageNet	CIFAR-10	CIFAR-100	SUN397	Caltech-101	STL-10
CLIP	40.6	78.9	48.3	43.0	70.7	92.4
MERU	40.1	78.6	49.3	43.0	73.0	92.8
HyCoCLIP	45.8	88.8	60.1	57.2	81.3	95.0
UNCHA (Ours)	48.8	90.4	63.2	57.7	83.9	95.7

Multi-object representation (ViT-B/16, mAP)

Method	ComCo 2obj	ComCo 5obj	SimCo 2obj	SimCo 5obj	VOC	COCO
CLIP	77.55	80.22	77.15	88.48	78.56	53.94
HyCoCLIP	72.90	72.90	75.71	82.85	80.43	58.12
UNCHA (Ours)	77.92	81.18	79.72	90.65	82.14	59.43

Training

Training requires preprocessing GRIT dataset:

python utils/prepare_GRIT_webdataset.py \
    --raw_webdataset_path datasets/train/GRIT/raw \
    --processed_webdataset_path datasets/train/GRIT/processed

Then run:

./scripts/train.sh \
    --config configs/train_uncha_vit_b.py \
    --num-gpus 4

📈 Evaluation

Zero-shot classification

python scripts/evaluate.py \
    --config configs/eval_zero_shot_classification.py \
    --checkpoint-path /path/to/ckpt

Retrieval

python scripts/evaluate.py \
    --config configs/eval_zero_shot_retrieval.py \
    --checkpoint-path /path/to/ckpt

Citation

@inproceedings{kim2026uncha,
  author    = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
  title     = {UNCHA: Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models},
  booktitle = {CVPR},
  year      = {2026},
}

Acknowledgements

This work is supported by IITP, NRF, MSIT, and Seoul National University programs. We also acknowledge prior works including MERU, HyCoCLIP, and ATMG.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for hayeonkim/uncha

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

Paper • 2603.22042 • Published Mar 23 • 3