Add comprehensive README + model card with all 4 variants

7999927 verified 3 days ago

9.73 kB

	---
	license: apache-2.0
	library_name: fastvideo
	pipeline_tag: text-to-video
	tags:
	- text-to-video
	- image-to-video
	- text-to-audio
	- super-resolution
	- magi-human
	- davinci
	- sii-gair
	- sand-ai
	- audio-visual-generation
	- multimodal
	base_model:
	- GAIR/daVinci-MagiHuman
	---

	# daVinci-MagiHuman — FastVideo Diffusers port

	A FastVideo-format port of [SII-GAIR + Sand.ai's daVinci-MagiHuman](https://github.com/GAIR-NLP/daVinci-MagiHuman) joint
	audio-visual generative model. Single repo, four sibling subfolders, one umbrella
	HF string per variant, all four bit-exact vs the official reference.

	> 15B-parameter single-stream transformer that jointly denoises video + audio in
	> a unified token sequence. Generates a 5-second 256p clip with synchronized
	> audio in ~2 s on a single H100. See the
	> [paper](https://arxiv.org/abs/2603.21986) and the
	> [official repo](https://github.com/GAIR-NLP/daVinci-MagiHuman).

	## Variant matrix

	\| Subfolder \| Model \| Inference modes \| Steps \| CFG \| Output \| DiT files \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| `base/` \| base 15B \| T2V, TI2V \| 32 \| CFG=2 \| 480x256 mp4 (video + audio) \| 7 \|
	\| `distill/` \| DMD-2 distilled 15B \| T2V, TI2V \| 8 \| no CFG \| 480x256 mp4 (video + audio) \| 7 \|
	\| `sr_540p/` \| base + SR 540p \| T2V, TI2V \| 32 + 5 \| CFG=2 + SR cfg-trick \| ~896x512 mp4 (video + audio) \| 20 \|
	\| `sr_1080p/` \| base + SR 1080p (block-sparse local-window attention on 32/40 SR DiT layers) \| T2V, TI2V \| 32 + 5 \| CFG=2 + SR cfg-trick \| ~1920x1088 mp4 (video + audio) \| 15 \|

	`T2V` = text only. `TI2V` = text + reference image; the image is encoded
	through the Wan VAE and stitched into the first video latent frame at every
	denoise step (matches upstream `evaluate_with_latent` per-step overwrite).

	All four DiTs share the same architecture (40 layers, hidden=5120, head_dim=128,
	GQA num_query_groups=8); only the weights differ. SR-1080p additionally
	restricts video→video attention to a local window of `frame_receptive_field=11`
	on 32 of 40 SR DiT layers (matches upstream's `SR2_1080` config override).

	## Quick start

	Install [FastVideo](https://github.com/hao-ai-lab/FastVideo) (commit `c05c1048`
	or later in the `will/magi` branch contains all four variants):

	```bash
	uv pip install fastvideo
	# or pin: uv pip install 'fastvideo @ git+https://github.com/hao-ai-lab/FastVideo@will/magi'
	```

	Accept terms on the two gated upstream repos that the pipeline lazy-loads from:

	- [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) — text encoder + tokenizer
	- [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) — audio VAE

	Then export your `HF_TOKEN` and run any of:

	### Base T2V (~5 s on H100)

	```python
	from fastvideo import VideoGenerator

	generator = VideoGenerator.from_pretrained(
	"FastVideo/MagiHuman-Diffusers/base",
	num_gpus=1,
	)
	generator.generate_video(
	prompt="A warm afternoon scene: a person sits on a park bench reading a book, "
	"surrounded by softly swaying trees.",
	output_path="output.mp4",
	save_video=True,
	)
	generator.shutdown()
	```

	### Distill T2V (~2 s on H100, no CFG)

	```python
	from fastvideo import VideoGenerator
	generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/distill", num_gpus=1)
	generator.generate_video(prompt="...", output_path="output.mp4", save_video=True)
	generator.shutdown()
	```

	### Base TI2V (text + reference image)

	```python
	from fastvideo import VideoGenerator
	from fastvideo.pipelines.basic.magi_human.pipeline_configs import MagiHumanBaseI2VConfig

	generator = VideoGenerator.from_pretrained(
	"FastVideo/MagiHuman-Diffusers/base",
	num_gpus=1,
	workload_type="i2v",
	override_pipeline_cls_name="MagiHumanI2VPipeline",
	pipeline_config=MagiHumanBaseI2VConfig(),
	)
	generator.generate_video(
	prompt="A cheerful saxophonist performs a short line in a small jazz club.",
	image_path="reference.jpg",
	output_path="output.mp4",
	save_video=True,
	)
	generator.shutdown()
	```

	### Super-resolution (540p)

	```python
	from fastvideo import VideoGenerator
	generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_540p", num_gpus=1)
	generator.generate_video(prompt="...", output_path="output_540p.mp4", save_video=True)
	generator.shutdown()
	```

	### Super-resolution (1080p)

	```python
	from fastvideo import VideoGenerator
	generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_1080p", num_gpus=1)
	generator.generate_video(prompt="...", output_path="output_1080p.mp4", save_video=True)
	generator.shutdown()
	```

	For full runnable examples for all eight (variant × mode) combinations see
	[`examples/inference/basic/basic_magi_human*.py`](https://github.com/hao-ai-lab/FastVideo/tree/will/magi/examples/inference/basic).

	## Lazy-load contract — what FastVideo fetches

	Each subfolder of this repo only ships variant-specific weights:

	```
	<subfolder>/
	├── model_index.json
	├── transformer/ ← variant DiT weights
	├── scheduler/ ← FlowUniPCMultistepScheduler config
	└── sr_transformer/ ← only in sr_540p/, sr_1080p/
	```

	The four cross-variant shared components (~25 GB total) are **lazy-loaded from
	their canonical upstream HF repos** the first time the pipeline runs:

	\| Component \| Source \| Gated? \|
	\|---\|---\|---\|
	\| Wan 2.2 TI2V-5B VAE (video decode) \| [`Wan-AI/Wan2.2-TI2V-5B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers) \| no \|
	\| T5-Gemma 9B encoder + tokenizer \| [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) \| yes (Google terms of use) \|
	\| Stable Audio Open 1.0 VAE (audio decode) \| [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) \| yes (Stability AI terms of use) \|

	Net effect: a user running all four variants downloads ~50 GB of variant
	weights + a single ~25 GB shared cache, totaling ~75 GB instead of ~400 GB if
	each variant bundled its own copies.

	## Re-converting from raw upstream weights

	If you need to re-convert from the raw [GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman):

	```bash
	# Each variant is converted individually; the umbrella layout is the
	# concatenation of these four outputs.
	python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
	--source GAIR/daVinci-MagiHuman \
	--subfolder base \
	--output local_weights/base

	python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
	--source GAIR/daVinci-MagiHuman \
	--subfolder distill \
	--output local_weights/distill \
	--cast-bf16

	python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
	--source GAIR/daVinci-MagiHuman \
	--subfolder base \
	--sr-source GAIR/daVinci-MagiHuman --sr-subfolder 540p_sr \
	--output local_weights/sr_540p \
	--cast-bf16

	python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
	--source GAIR/daVinci-MagiHuman \
	--subfolder base \
	--sr-source GAIR/daVinci-MagiHuman --sr-subfolder 1080p_sr \
	--output local_weights/sr_1080p \
	--cast-bf16
	```

	Pass `--bundle-vae` / `--bundle-audio-vae` / `--bundle-text-encoder` if you
	want a self-contained snapshot instead of relying on lazy-load.

	## Parity vs official daVinci-MagiHuman

	All four variants pass FastVideo's local parity battery bit-exact
	(`diff_max=0.0, diff_mean=0.0`) against the official reference DiT:

	\| Test \| Result \|
	\|---\|---\|
	\| `test_magi_human_dit_parity` (base) \| bit-exact \|
	\| `test_magi_human_distill_dit_parity` \| bit-exact \|
	\| `test_magi_human_pipeline_latent_parity` (base T2V) \| bit-exact \|
	\| `test_magi_human_ti2v_pipeline_latent_parity` \| bit-exact \|
	\| `test_magi_human_sr540p_pipeline_latent_parity[t2v / ti2v]` \| bit-exact \|
	\| `test_magi_human_sr1080p_pipeline_latent_parity[t2v / ti2v]` \| bit-exact \|
	\| `test_magi_human_t5gemma_parity` \| bit-exact \|
	\| `test_magi_human_sa_audio_parity` (FV + official) \| bit-exact \|
	\| `test_magi_human_vae_parity` (Wan VAE decode) \| 8e-4 max (fp32 op-order drift, tracked) \|

	Block-sparse local-window attention for SR-1080p is implemented as a
	3-block accumulator over vanilla SDPA (per-frame video→local-video +
	all-video→audio+text + audio+text→all), which mathematically matches
	upstream's
	[`magi_attention.api.flex_flash_attn_func`](https://github.com/SandAI-org/MagiAttention/blob/main/magi_attention/functional/flex_flash_attn.py)
	contract for this 3-block layout. Bit-exact verified.

	## Citation

	```bibtex
	@article{davinci-magihuman-2026,
	title = {Speed by Simplicity: A Single-Stream Architecture for Fast
	Audio-Video Generative Foundation Model},
	author = {SII-GAIR and Sand.ai},
	journal= {arXiv preprint arXiv:2603.21986},
	year = {2026}
	}

	@misc{fastvideo-magihuman-port,
	title = {{daVinci-MagiHuman} for {FastVideo}},
	author = {{FastVideo team}},
	year = {2026},
	howpublished = {\url{https://huggingface.co/FastVideo/MagiHuman-Diffusers}}
	}
	```

	## License

	Apache 2.0 (matches upstream
	[GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman)).

	## Acknowledgments

	- [SII-GAIR](https://plms.ai) and [Sand.ai](https://sand.ai) for the original
	daVinci-MagiHuman model and inference code.
	- [Wan-AI](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) for the Wan 2.2 video
	VAE.
	- [Google](https://huggingface.co/google/t5gemma-9b-9b-ul2) for the T5-Gemma
	text encoder.
	- [Stability AI](https://huggingface.co/stabilityai/stable-audio-open-1.0) for
	the Stable Audio Open 1.0 audio VAE.
	- [SandAI-org / MagiAttention](https://github.com/SandAI-org/MagiAttention) for
	the canonical FFA / `flex_flash_attn_func` reference.