savyak2
/

T-REN

Image Feature Extraction

Model card Files Files and versions

T-REN / README.md

savyak2's picture

Add model card info and metadata (#1)

cfc2ea8 2 days ago

|

history blame contribute delete

1.54 kB

	---
	license: mit
	pipeline_tag: image-feature-extraction
	---

	# T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability

	T-REN (Text-aligned Region Encoder Network) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-L/16 backbone and adds only 3.7% additional parameters.

	Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.

	- Paper: [T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability](https://huggingface.co/papers/2604.18573)
	- GitHub Repository: [savya08/T-REN](https://github.com/savya08/T-REN)

	## Highlights

	Specifically, T-REN delivers:
	- +5.9 mIoU on ADE20K open-vocabulary segmentation.
	- +18.4% recall on COCO object-level text-image retrieval.
	- +15.6% recall on Ego4D video object localization.
	- +17.6% mIoU on VSPW video scene parsing.

	## Citation

	```bibtex
	@misc{khosla2026tren,
	title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
	author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
	year={2026},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	}
	```