Add model card for VideoMLA
Browse filesHi! I'm Niels from the community science team at Hugging Face.
I noticed that this repository didn't have a model card. I've opened this PR to add a README that includes:
- Metadata for the `text-to-video` pipeline tag.
- Links to the paper, project page, and GitHub repository.
- A brief introduction to VideoMLA.
- Sample usage for inference based on your GitHub README.
Feel free to merge this or let me know if you'd like any changes!
README.md
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-to-video
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
|
| 6 |
+
|
| 7 |
+
VideoMLA is the first study of Multi-Head Latent Attention (MLA) in video diffusion. By replacing per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, it reduces per-token KV memory by 92.7% at every cached layer. This enables efficient, minute-scale autoregressive video generation with improved throughput.
|
| 8 |
+
|
| 9 |
+
[[Paper](https://huggingface.co/papers/2605.30351)] [[Project Page](https://videomla.github.io/)] [[GitHub](https://github.com/yesiltepe-hidir/VideoMLA)]
|
| 10 |
+
|
| 11 |
+
## Inference
|
| 12 |
+
|
| 13 |
+
To use the model, please follow the setup instructions in the [official repository](https://github.com/yesiltepe-hidir/VideoMLA). You can generate videos using the provided inference script:
|
| 14 |
+
|
| 15 |
+
```bash
|
| 16 |
+
python inference.py \
|
| 17 |
+
--config_path configs/stage3_long.yaml \
|
| 18 |
+
--checkpoint_path checkpoints/stage3_la6_sink1/model.pt \
|
| 19 |
+
--output_folder outputs/ \
|
| 20 |
+
--data_path prompts/your_prompts.txt \
|
| 21 |
+
--num_output_frames 120 \
|
| 22 |
+
--use_ema
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
Key arguments:
|
| 26 |
+
- `--num_output_frames`: Controls the length of the video (e.g., 21 ≈ 5s, 120 ≈ 30s, 240 ≈ 60s at 16fps).
|
| 27 |
+
- `--data_path`: A text file containing prompts (one per line).
|
| 28 |
+
|
| 29 |
+
## Citation
|
| 30 |
+
|
| 31 |
+
```bibtex
|
| 32 |
+
@article{yesiltepe2026videomla,
|
| 33 |
+
title={VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion},
|
| 34 |
+
author={Yesiltepe, Hidir and Hu, Jiazhen and Meral, Tuna Han Salih and Akan, Adil Kaan and Oktay, Kaan and Eldardiry, Hoda and Yanardag, Pinar},
|
| 35 |
+
journal={arXiv preprint arXiv:2605.30351},
|
| 36 |
+
year={2026}
|
| 37 |
+
}
|
| 38 |
+
```
|