| license: mit | |
| pipeline_tag: image-feature-extraction | |
| # T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability | |
| T-REN (**T**ext-aligned **R**egion **E**ncoder **N**etwork) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-L/16 backbone and adds only 3.7% additional parameters. | |
| Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos. | |
| - **Paper:** [T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability](https://huggingface.co/papers/2604.18573) | |
| - **GitHub Repository:** [savya08/T-REN](https://github.com/savya08/T-REN) | |
| ## Highlights | |
| Specifically, T-REN delivers: | |
| - **+5.9 mIoU** on ADE20K open-vocabulary segmentation. | |
| - **+18.4% recall** on COCO object-level text-image retrieval. | |
| - **+15.6% recall** on Ego4D video object localization. | |
| - **+17.6% mIoU** on VSPW video scene parsing. | |
| ## Citation | |
| ```bibtex | |
| @misc{khosla2026tren, | |
| title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability}, | |
| author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem}, | |
| year={2026}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CV}, | |
| } | |
| ``` |