T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

T-REN (Text-aligned Region Encoder Network) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen DINOv3 ViT-L/16 backbone and adds only 3.7% additional parameters.

Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.

Paper: T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
GitHub Repository: savya08/T-REN

Highlights

Specifically, T-REN delivers:

+5.9 mIoU on ADE20K open-vocabulary segmentation.
+18.4% recall on COCO object-level text-image retrieval.
+15.6% recall on Ego4D video object localization.
+17.6% mIoU on VSPW video scene parsing.

Citation

@misc{khosla2026tren,
      title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability}, 
      author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
      year={2026},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for savyak2/T-REN

T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

Paper • 2604.18573 • Published 4 days ago