T-REN / README.md

nielsr HF Staff

Add model card info and metadata

6ea0fa8 verified 3 days ago

1.54 kB

license: mit
pipeline_tag: image-feature-extraction

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

T-REN (Text-aligned Region Encoder Network) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen DINOv3 ViT-L/16 backbone and adds only 3.7% additional parameters.

Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.

Paper: T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
GitHub Repository: savya08/T-REN

Highlights

Specifically, T-REN delivers:

+5.9 mIoU on ADE20K open-vocabulary segmentation.
+18.4% recall on COCO object-level text-image retrieval.
+15.6% recall on Ego4D video object localization.
+17.6% mIoU on VSPW video scene parsing.

Citation

@misc{khosla2026tren,
      title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability}, 
      author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
      year={2026},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}