onkarsus13
/

PyraTok

Model card Files Files and versions

PyraTok / README.md

onkarsus13's picture

added new model

5b40df0 5 days ago

|

history blame contribute delete

2 kB


	# PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

	<div align="center">
	<a href="https://plan-lab.github.io/pyratok"><img src="https://img.shields.io/badge/Project-Website-blue?style=for-the-badge&logo=googlechrome"></a>
	<a href="https://arxiv.org/abs/2601.16210"><img src="https://img.shields.io/badge/arXiv-2601.16210-b31b1b.svg?style=for-the-badge"></a>
	<a href="https://github.com/PLAN-Lab/PyraTok"><img src="https://img.shields.io/badge/Code-GitHub-black?style=for-the-badge&logo=github"></a>
	</div>

	---

	### 📢 Official Announcement
	PyraTok has been officially accepted to CVPR 2026! 🎉
	This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer.

	---

	## 🚀 Overview

	PyraTok is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a Language-aligned Pyramidal Quantization (LaPQ) module.

	### Key Innovations:
	* Pyramidal Structure: Learns semantically structured discrete latents across multiple spatiotemporal resolutions.
	* Language Alignment: Tightly couples visual tokens with language using a shared, large binary codebook (up to 48K tokens).
	* Scalability: Robustly scales from standard resolutions to 4K/8K video processing.
	* Unified Backbone: A single model that excels in Video QA, Zero-Shot Segmentation, and high-fidelity Text-to-Video generation.



	```
	@inproceedings{susladkar2026pyratok,
	title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
	author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
	booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
	year={2026}
	}
	```