savyak2 nielsr HF Staff commited on
Commit
cfc2ea8
·
1 Parent(s): bf315ea

Add model card info and metadata (#1)

Browse files

- Add model card info and metadata (6ea0fa84ec980971f0eab1dd7ff75e1148853e41)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +33 -3
README.md CHANGED
@@ -1,3 +1,33 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: image-feature-extraction
4
+ ---
5
+
6
+ # T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
7
+
8
+ T-REN (**T**ext-aligned **R**egion **E**ncoder **N**etwork) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-L/16 backbone and adds only 3.7% additional parameters.
9
+
10
+ Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.
11
+
12
+ - **Paper:** [T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability](https://huggingface.co/papers/2604.18573)
13
+ - **GitHub Repository:** [savya08/T-REN](https://github.com/savya08/T-REN)
14
+
15
+ ## Highlights
16
+
17
+ Specifically, T-REN delivers:
18
+ - **+5.9 mIoU** on ADE20K open-vocabulary segmentation.
19
+ - **+18.4% recall** on COCO object-level text-image retrieval.
20
+ - **+15.6% recall** on Ego4D video object localization.
21
+ - **+17.6% mIoU** on VSPW video scene parsing.
22
+
23
+ ## Citation
24
+
25
+ ```bibtex
26
+ @misc{khosla2026tren,
27
+ title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
28
+ author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
29
+ year={2026},
30
+ archivePrefix={arXiv},
31
+ primaryClass={cs.CV},
32
+ }
33
+ ```