Add model card info and metadata (#1)
Browse files- Add model card info and metadata (6ea0fa84ec980971f0eab1dd7ff75e1148853e41)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,3 +1,33 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
pipeline_tag: image-feature-extraction
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability
|
| 7 |
+
|
| 8 |
+
T-REN (**T**ext-aligned **R**egion **E**ncoder **N**etwork) is an efficient encoder that maps visual data to a compact set of text-aligned region-level representations (region tokens). It is built on top of a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-L/16 backbone and adds only 3.7% additional parameters.
|
| 9 |
+
|
| 10 |
+
Compared to patch-based vision-language backbones, T-REN yields stronger dense cross-modal understanding while significantly reducing token counts by more than 24x for images and 187x for videos.
|
| 11 |
+
|
| 12 |
+
- **Paper:** [T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability](https://huggingface.co/papers/2604.18573)
|
| 13 |
+
- **GitHub Repository:** [savya08/T-REN](https://github.com/savya08/T-REN)
|
| 14 |
+
|
| 15 |
+
## Highlights
|
| 16 |
+
|
| 17 |
+
Specifically, T-REN delivers:
|
| 18 |
+
- **+5.9 mIoU** on ADE20K open-vocabulary segmentation.
|
| 19 |
+
- **+18.4% recall** on COCO object-level text-image retrieval.
|
| 20 |
+
- **+15.6% recall** on Ego4D video object localization.
|
| 21 |
+
- **+17.6% mIoU** on VSPW video scene parsing.
|
| 22 |
+
|
| 23 |
+
## Citation
|
| 24 |
+
|
| 25 |
+
```bibtex
|
| 26 |
+
@misc{khosla2026tren,
|
| 27 |
+
title={T-REN: Learning Text-Aligned Region Tokens Improves Dense Vision-Language Alignment and Scalability},
|
| 28 |
+
author={Savya Khosla and Sethuraman T V and Aryan Chadha and Alexander Schwing and Derek Hoiem},
|
| 29 |
+
year={2026},
|
| 30 |
+
archivePrefix={arXiv},
|
| 31 |
+
primaryClass={cs.CV},
|
| 32 |
+
}
|
| 33 |
+
```
|