| | --- |
| | license: apache-2.0 |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # UniTok: A Unified Tokenizer for Visual Generation and Understanding |
| |
|
| | This repository contains UniTok, a unified visual tokenizer for both image generation and understanding tasks, as presented in [UniTok: A Unified Tokenizer for Visual Generation and Understanding](https://hf.co/papers/2502.20321). |
| |
|
| | Project Page: https://foundationvision.github.io/UniTok/ <br> |
| | Code: https://github.com/FoundationVision/UniTok |
| |
|
| | <p align="center"> |
| | <img src="https://github.com/FoundationVision/UniTok/blob/main/assets/teaser.png?raw=true" width=93%> |
| | <p> |
| |
|
| | UniTok encodes fine-grained details for generation and captures high-level semantics for understanding. It's compatible with autoregressive generative models (e.g., LlamaGen), multimodal understanding models (e.g., LLaVA), and unified MLLMs (e.g., Chameleon and Liquid). |
| |
|
| | Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding, which sets a new state-of-the-art among unified autoregressive MLLMs. The weights of our MLLM will be released soon. |
| |
|
| | <p align="center"> |
| | <img src="https://github.com/FoundationVision/UniTok/blob/main/assets/samples.png?raw=true" width=93%> |
| | <p> |
| |
|
| | ## Performance |
| |
|
| | <table> |
| | <thead> |
| | <tr> |
| | <th>Method</th> |
| | <th>#Tokens</th> |
| | <th>rFID ↓</th> |
| | <th>Accuracy</th> |
| | </tr> |
| | </thead> |
| | <tbody> |
| | <tr> |
| | <td colspan="4"><i>VQVAE Model</i></td> |
| | </tr> |
| | <tr align="center"> |
| | <td>VQ-GAN</td> |
| | <td>256</td> |
| | <td>4.98</td> |
| | <td>--</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>RQ-VAE</td> |
| | <td>256</td> |
| | <td>1.30</td> |
| | <td>--</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>VAR</td> |
| | <td>680</td> |
| | <td>0.90</td> |
| | <td>--</td> |
| | </tr> |
| | <tr> |
| | <td colspan="4"><i>CLIP Model</i></td> |
| | </tr> |
| | <tr align="center"> |
| | <td>CLIP</td> |
| | <td>256</td> |
| | <td>--</td> |
| | <td>76.2</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>SigLIP</td> |
| | <td>256</td> |
| | <td>--</td> |
| | <td>80.5</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>ViTamin</td> |
| | <td>256</td> |
| | <td>--</td> |
| | <td>81.2</td> |
| | </tr> |
| | <tr> |
| | <td colspan="4"><i>Unified Model</i></td> |
| | </tr> |
| | <tr align="center"> |
| | <td>TokenFlow †</td> |
| | <td>680</td> |
| | <td>1.37</td> |
| | <td>--</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>VILA-U †</td> |
| | <td>256</td> |
| | <td>1.80</td> |
| | <td>73.3</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>UniTok</td> |
| | <td>256</td> |
| | <td>0.39</td> |
| | <td>70.5</td> |
| | </tr> |
| | <tr align="center"> |
| | <td>UniTok †</td> |
| | <td>256</td> |
| | <td>0.38</td> |
| | <td>78.6</td> |
| | </tr> |
| | </tbody> |
| | </table> |
| | |
| |
|
| | This repo is used for hosting UniTok's checkpoints. |
| |
|
| | For more details or tutorials see https://github.com/FoundationVision/UniTok. |
| |
|
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{unitok, |
| | title={UniTok: A Unified Tokenizer for Visual Generation and Understanding}, |
| | author={Ma, Chuofan and Jiang, Yi and Wu, Junfeng and Yang, Jihan and Yu, Xin and Yuan, Zehuan and Peng, Bingyue and Qi, Xiaojuan}, |
| | journal={arXiv preprint arXiv:2502.20321}, |
| | year={2025} |
| | } |
| | ``` |