T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
Paper • 2604.18573 • Published
T-REN (Text-aligned Region Encoder Network) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen DINOv3 ViT-L/16 backbone and adds only 3.7% additional parameters.
Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.
Specifically, T-REN delivers:
@misc{khosla2026tren,
title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
year={2026},
archivePrefix={arXiv},
primaryClass={cs.CV},
}