T-REN / README.md
nielsr's picture
nielsr HF Staff
Add model card info and metadata
6ea0fa8 verified
|
raw
history blame
1.54 kB
metadata
license: mit
pipeline_tag: image-feature-extraction

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

T-REN (Text-aligned Region Encoder Network) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen DINOv3 ViT-L/16 backbone and adds only 3.7% additional parameters.

Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.

Highlights

Specifically, T-REN delivers:

  • +5.9 mIoU on ADE20K open-vocabulary segmentation.
  • +18.4% recall on COCO object-level text-image retrieval.
  • +15.6% recall on Ego4D video object localization.
  • +17.6% mIoU on VSPW video scene parsing.

Citation

@misc{khosla2026tren,
      title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability}, 
      author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
      year={2026},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}