FastVideo
/

FastWan2.1-T2V-14B-Diffusers

Model card Files Files and versions

FastWan2.1-T2V-14B-Diffusers / README.md

BrianChen1129's picture

Update README.md

811c84e verified 7 months ago

|

2.97 kB

	---
	license: apache-2.0
	datasets:
	- FastVideo/Wan-Syn_77x448x832_600k
	base_model:
	- Wan-AI/Wan2.1-T2V-14B-Diffusers
	---
	# FastVideo FastWan2.1-T2V-14B-480P-Diffusers
	<p align="center">
	<img src="https://raw.githubusercontent.com/hao-ai-lab/FastVideo/main/assets/logo.jpg" width="200"/>
	</p>
	<div>
	<div align="center">
	<a href="https://github.com/hao-ai-lab/FastVideo" target="_blank">FastVideo Team</a>&emsp;
	</div>

	<div align="center">
	<a href="https://arxiv.org/pdf/2505.13389">Paper</a> \|
	<a href="https://github.com/hao-ai-lab/FastVideo">Github</a>
	</div>
	</div>



	## Introduction

	This model is jointly finetuned with [DMD](https://arxiv.org/pdf/2405.14867) and [VSA](https://arxiv.org/pdf/2505.13389), based on [Wan-AI/Wan2.1-T2V-1.3B-Diffusers](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B-Diffusers). It supports efficient 3-step inference and generates high-quality videos at 61×448×832 resolution. We adopt the [FastVideo 480P Synthetic Wan dataset](https://huggingface.co/datasets/FastVideo/Wan-Syn_77x448x832_600k), consisting of 600k synthetic latents.

	---

	## Model Overview

	- 3-step inference is supported and achieves up to 50x speed up on a single H100 GPU.
	- Supports generating videos with resolution 61×448×832.
	- Finetuning and inference scripts are available in the [FastVideo](https://github.com/hao-ai-lab/FastVideo) repository:
	- [Finetuning script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/distill/v1_distill_dmd_wan_VSA.sh)
	- [Inference script](https://github.com/hao-ai-lab/FastVideo/blob/main/scripts/inference/v1_inference_wan_dmd.sh)
	- Try it out on FastVideo — we support a wide range of GPUs from H100 to 4090, and also support Mac users!

	### Training Infrastructure

	Training was conducted on 8 nodes with 64 H200 GPUs in total, using a `global batch size = 64`.
	We enable `gradient checkpointing`, set `HSDP_shard_dim = 8`, `sequence_parallel_size = 4`, and use `learning rate = 1e-5`.
	We set VSA attention sparsity to 0.9, and training runs for 3000 steps (~52 hours)
	The detailed training example script is available [here](https://github.com/hao-ai-lab/FastVideo/blob/main/examples/distill/Wan-Syn-480P/distill_dmd_VSA_t2v_14B_480P.slurm).



	If you use FastWan2.1-T2V-14B-480P-Diffusers model for your research, please cite our paper:
	```
	@article{zhang2025vsa,
	title={VSA: Faster Video Diffusion with Trainable Sparse Attention},
	author={Zhang, Peiyuan and Huang, Haofeng and Chen, Yongqi and Lin, Will and Liu, Zhengzhong and Stoica, Ion and Xing, Eric and Zhang, Hao},
	journal={arXiv preprint arXiv:2505.13389},
	year={2025}
	}
	@article{zhang2025fast,
	title={Fast video generation with sliding tile attention},
	author={Zhang, Peiyuan and Chen, Yongqi and Su, Runlong and Ding, Hangliang and Stoica, Ion and Liu, Zhengzhong and Zhang, Hao},
	journal={arXiv preprint arXiv:2502.04507},
	year={2025}
	}
	```