MagiHuman-Diffusers / README.md
wlsaidhi's picture
Add comprehensive README + model card with all 4 variants
7999927 verified
---
license: apache-2.0
library_name: fastvideo
pipeline_tag: text-to-video
tags:
- text-to-video
- image-to-video
- text-to-audio
- super-resolution
- magi-human
- davinci
- sii-gair
- sand-ai
- audio-visual-generation
- multimodal
base_model:
- GAIR/daVinci-MagiHuman
---
# daVinci-MagiHuman — FastVideo Diffusers port
A FastVideo-format port of [SII-GAIR + Sand.ai's daVinci-MagiHuman](https://github.com/GAIR-NLP/daVinci-MagiHuman) joint
audio-visual generative model. Single repo, four sibling subfolders, one umbrella
HF string per variant, all four bit-exact vs the official reference.
> 15B-parameter single-stream transformer that jointly denoises video + audio in
> a unified token sequence. Generates a 5-second 256p clip with synchronized
> audio in ~2 s on a single H100. See the
> [paper](https://arxiv.org/abs/2603.21986) and the
> [official repo](https://github.com/GAIR-NLP/daVinci-MagiHuman).
## Variant matrix
| Subfolder | Model | Inference modes | Steps | CFG | Output | DiT files |
|---|---|---|---|---|---|---|
| `base/` | base 15B | T2V, TI2V | 32 | CFG=2 | 480x256 mp4 (video + audio) | 7 |
| `distill/` | DMD-2 distilled 15B | T2V, TI2V | 8 | no CFG | 480x256 mp4 (video + audio) | 7 |
| `sr_540p/` | base + SR 540p | T2V, TI2V | 32 + 5 | CFG=2 + SR cfg-trick | ~896x512 mp4 (video + audio) | 20 |
| `sr_1080p/` | base + SR 1080p (block-sparse local-window attention on 32/40 SR DiT layers) | T2V, TI2V | 32 + 5 | CFG=2 + SR cfg-trick | ~1920x1088 mp4 (video + audio) | 15 |
`T2V` = text only. `TI2V` = text + reference image; the image is encoded
through the Wan VAE and stitched into the first video latent frame at every
denoise step (matches upstream `evaluate_with_latent` per-step overwrite).
All four DiTs share the same architecture (40 layers, hidden=5120, head_dim=128,
GQA num_query_groups=8); only the weights differ. SR-1080p additionally
restricts video→video attention to a local window of `frame_receptive_field=11`
on 32 of 40 SR DiT layers (matches upstream's `SR2_1080` config override).
## Quick start
Install [FastVideo](https://github.com/hao-ai-lab/FastVideo) (commit `c05c1048`
or later in the `will/magi` branch contains all four variants):
```bash
uv pip install fastvideo
# or pin: uv pip install 'fastvideo @ git+https://github.com/hao-ai-lab/FastVideo@will/magi'
```
Accept terms on the two gated upstream repos that the pipeline lazy-loads from:
- [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) — text encoder + tokenizer
- [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) — audio VAE
Then export your `HF_TOKEN` and run any of:
### Base T2V (~5 s on H100)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained(
"FastVideo/MagiHuman-Diffusers/base",
num_gpus=1,
)
generator.generate_video(
prompt="A warm afternoon scene: a person sits on a park bench reading a book, "
"surrounded by softly swaying trees.",
output_path="output.mp4",
save_video=True,
)
generator.shutdown()
```
### Distill T2V (~2 s on H100, no CFG)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/distill", num_gpus=1)
generator.generate_video(prompt="...", output_path="output.mp4", save_video=True)
generator.shutdown()
```
### Base TI2V (text + reference image)
```python
from fastvideo import VideoGenerator
from fastvideo.pipelines.basic.magi_human.pipeline_configs import MagiHumanBaseI2VConfig
generator = VideoGenerator.from_pretrained(
"FastVideo/MagiHuman-Diffusers/base",
num_gpus=1,
workload_type="i2v",
override_pipeline_cls_name="MagiHumanI2VPipeline",
pipeline_config=MagiHumanBaseI2VConfig(),
)
generator.generate_video(
prompt="A cheerful saxophonist performs a short line in a small jazz club.",
image_path="reference.jpg",
output_path="output.mp4",
save_video=True,
)
generator.shutdown()
```
### Super-resolution (540p)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_540p", num_gpus=1)
generator.generate_video(prompt="...", output_path="output_540p.mp4", save_video=True)
generator.shutdown()
```
### Super-resolution (1080p)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_1080p", num_gpus=1)
generator.generate_video(prompt="...", output_path="output_1080p.mp4", save_video=True)
generator.shutdown()
```
For full runnable examples for all eight (variant × mode) combinations see
[`examples/inference/basic/basic_magi_human*.py`](https://github.com/hao-ai-lab/FastVideo/tree/will/magi/examples/inference/basic).
## Lazy-load contract — what FastVideo fetches
Each subfolder of this repo only ships variant-specific weights:
```
<subfolder>/
├── model_index.json
├── transformer/ ← variant DiT weights
├── scheduler/ ← FlowUniPCMultistepScheduler config
└── sr_transformer/ ← only in sr_540p/, sr_1080p/
```
The four cross-variant shared components (~25 GB total) are **lazy-loaded from
their canonical upstream HF repos** the first time the pipeline runs:
| Component | Source | Gated? |
|---|---|---|
| Wan 2.2 TI2V-5B VAE (video decode) | [`Wan-AI/Wan2.2-TI2V-5B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers) | no |
| T5-Gemma 9B encoder + tokenizer | [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) | yes (Google terms of use) |
| Stable Audio Open 1.0 VAE (audio decode) | [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) | yes (Stability AI terms of use) |
Net effect: a user running all four variants downloads ~50 GB of variant
weights + a single ~25 GB shared cache, totaling ~75 GB instead of ~400 GB if
each variant bundled its own copies.
## Re-converting from raw upstream weights
If you need to re-convert from the raw [GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman):
```bash
# Each variant is converted individually; the umbrella layout is the
# concatenation of these four outputs.
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder base \
--output local_weights/base
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder distill \
--output local_weights/distill \
--cast-bf16
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder base \
--sr-source GAIR/daVinci-MagiHuman --sr-subfolder 540p_sr \
--output local_weights/sr_540p \
--cast-bf16
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder base \
--sr-source GAIR/daVinci-MagiHuman --sr-subfolder 1080p_sr \
--output local_weights/sr_1080p \
--cast-bf16
```
Pass `--bundle-vae` / `--bundle-audio-vae` / `--bundle-text-encoder` if you
want a self-contained snapshot instead of relying on lazy-load.
## Parity vs official daVinci-MagiHuman
All four variants pass FastVideo's local parity battery **bit-exact**
(`diff_max=0.0, diff_mean=0.0`) against the official reference DiT:
| Test | Result |
|---|---|
| `test_magi_human_dit_parity` (base) | bit-exact |
| `test_magi_human_distill_dit_parity` | bit-exact |
| `test_magi_human_pipeline_latent_parity` (base T2V) | bit-exact |
| `test_magi_human_ti2v_pipeline_latent_parity` | bit-exact |
| `test_magi_human_sr540p_pipeline_latent_parity[t2v / ti2v]` | bit-exact |
| `test_magi_human_sr1080p_pipeline_latent_parity[t2v / ti2v]` | bit-exact |
| `test_magi_human_t5gemma_parity` | bit-exact |
| `test_magi_human_sa_audio_parity` (FV + official) | bit-exact |
| `test_magi_human_vae_parity` (Wan VAE decode) | 8e-4 max (fp32 op-order drift, tracked) |
Block-sparse local-window attention for SR-1080p is implemented as a
3-block accumulator over vanilla SDPA (per-frame video→local-video +
all-video→audio+text + audio+text→all), which mathematically matches
upstream's
[`magi_attention.api.flex_flash_attn_func`](https://github.com/SandAI-org/MagiAttention/blob/main/magi_attention/functional/flex_flash_attn.py)
contract for this 3-block layout. Bit-exact verified.
## Citation
```bibtex
@article{davinci-magihuman-2026,
title = {Speed by Simplicity: A Single-Stream Architecture for Fast
Audio-Video Generative Foundation Model},
author = {SII-GAIR and Sand.ai},
journal= {arXiv preprint arXiv:2603.21986},
year = {2026}
}
@misc{fastvideo-magihuman-port,
title = {{daVinci-MagiHuman} for {FastVideo}},
author = {{FastVideo team}},
year = {2026},
howpublished = {\url{https://huggingface.co/FastVideo/MagiHuman-Diffusers}}
}
```
## License
Apache 2.0 (matches upstream
[GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman)).
## Acknowledgments
- [SII-GAIR](https://plms.ai) and [Sand.ai](https://sand.ai) for the original
daVinci-MagiHuman model and inference code.
- [Wan-AI](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) for the Wan 2.2 video
VAE.
- [Google](https://huggingface.co/google/t5gemma-9b-9b-ul2) for the T5-Gemma
text encoder.
- [Stability AI](https://huggingface.co/stabilityai/stable-audio-open-1.0) for
the Stable Audio Open 1.0 audio VAE.
- [SandAI-org / MagiAttention](https://github.com/SandAI-org/MagiAttention) for
the canonical FFA / `flex_flash_attn_func` reference.