# PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

--- ### 📢 Official Announcement **PyraTok** has been officially accepted to **CVPR 2026**! 🎉 This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer. --- ## 🚀 Overview **PyraTok** is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a **Language-aligned Pyramidal Quantization (LaPQ)** module. ### Key Innovations: * **Pyramidal Structure:** Learns semantically structured discrete latents across multiple spatiotemporal resolutions. * **Language Alignment:** Tightly couples visual tokens with language using a shared, large binary codebook (up to 48K tokens). * **Scalability:** Robustly scales from standard resolutions to **4K/8K video** processing. * **Unified Backbone:** A single model that excels in Video QA, Zero-Shot Segmentation, and high-fidelity Text-to-Video generation. ``` @inproceedings{susladkar2026pyratok, title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation}, author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2026} } ```