| --- |
| license: apache-2.0 |
| library_name: fastvideo |
| pipeline_tag: text-to-video |
| tags: |
| - text-to-video |
| - image-to-video |
| - text-to-audio |
| - super-resolution |
| - magi-human |
| - davinci |
| - sii-gair |
| - sand-ai |
| - audio-visual-generation |
| - multimodal |
| base_model: |
| - GAIR/daVinci-MagiHuman |
| --- |
| |
| # daVinci-MagiHuman — FastVideo Diffusers port |
|
|
| A FastVideo-format port of [SII-GAIR + Sand.ai's daVinci-MagiHuman](https://github.com/GAIR-NLP/daVinci-MagiHuman) joint |
| audio-visual generative model. Single repo, four sibling subfolders, one umbrella |
| HF string per variant, all four bit-exact vs the official reference. |
|
|
| > 15B-parameter single-stream transformer that jointly denoises video + audio in |
| > a unified token sequence. Generates a 5-second 256p clip with synchronized |
| > audio in ~2 s on a single H100. See the |
| > [paper](https://arxiv.org/abs/2603.21986) and the |
| > [official repo](https://github.com/GAIR-NLP/daVinci-MagiHuman). |
|
|
| ## Variant matrix |
|
|
| | Subfolder | Model | Inference modes | Steps | CFG | Output | DiT files | |
| |---|---|---|---|---|---|---| |
| | `base/` | base 15B | T2V, TI2V | 32 | CFG=2 | 480x256 mp4 (video + audio) | 7 | |
| | `distill/` | DMD-2 distilled 15B | T2V, TI2V | 8 | no CFG | 480x256 mp4 (video + audio) | 7 | |
| | `sr_540p/` | base + SR 540p | T2V, TI2V | 32 + 5 | CFG=2 + SR cfg-trick | ~896x512 mp4 (video + audio) | 20 | |
| | `sr_1080p/` | base + SR 1080p (block-sparse local-window attention on 32/40 SR DiT layers) | T2V, TI2V | 32 + 5 | CFG=2 + SR cfg-trick | ~1920x1088 mp4 (video + audio) | 15 | |
|
|
| `T2V` = text only. `TI2V` = text + reference image; the image is encoded |
| through the Wan VAE and stitched into the first video latent frame at every |
| denoise step (matches upstream `evaluate_with_latent` per-step overwrite). |
|
|
| All four DiTs share the same architecture (40 layers, hidden=5120, head_dim=128, |
| GQA num_query_groups=8); only the weights differ. SR-1080p additionally |
| restricts video→video attention to a local window of `frame_receptive_field=11` |
| on 32 of 40 SR DiT layers (matches upstream's `SR2_1080` config override). |
|
|
| ## Quick start |
|
|
| Install [FastVideo](https://github.com/hao-ai-lab/FastVideo) (commit `c05c1048` |
| or later in the `will/magi` branch contains all four variants): |
|
|
| ```bash |
| uv pip install fastvideo |
| # or pin: uv pip install 'fastvideo @ git+https://github.com/hao-ai-lab/FastVideo@will/magi' |
| ``` |
|
|
| Accept terms on the two gated upstream repos that the pipeline lazy-loads from: |
|
|
| - [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) — text encoder + tokenizer |
| - [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) — audio VAE |
|
|
| Then export your `HF_TOKEN` and run any of: |
|
|
| ### Base T2V (~5 s on H100) |
|
|
| ```python |
| from fastvideo import VideoGenerator |
| |
| generator = VideoGenerator.from_pretrained( |
| "FastVideo/MagiHuman-Diffusers/base", |
| num_gpus=1, |
| ) |
| generator.generate_video( |
| prompt="A warm afternoon scene: a person sits on a park bench reading a book, " |
| "surrounded by softly swaying trees.", |
| output_path="output.mp4", |
| save_video=True, |
| ) |
| generator.shutdown() |
| ``` |
|
|
| ### Distill T2V (~2 s on H100, no CFG) |
|
|
| ```python |
| from fastvideo import VideoGenerator |
| generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/distill", num_gpus=1) |
| generator.generate_video(prompt="...", output_path="output.mp4", save_video=True) |
| generator.shutdown() |
| ``` |
|
|
| ### Base TI2V (text + reference image) |
|
|
| ```python |
| from fastvideo import VideoGenerator |
| from fastvideo.pipelines.basic.magi_human.pipeline_configs import MagiHumanBaseI2VConfig |
| |
| generator = VideoGenerator.from_pretrained( |
| "FastVideo/MagiHuman-Diffusers/base", |
| num_gpus=1, |
| workload_type="i2v", |
| override_pipeline_cls_name="MagiHumanI2VPipeline", |
| pipeline_config=MagiHumanBaseI2VConfig(), |
| ) |
| generator.generate_video( |
| prompt="A cheerful saxophonist performs a short line in a small jazz club.", |
| image_path="reference.jpg", |
| output_path="output.mp4", |
| save_video=True, |
| ) |
| generator.shutdown() |
| ``` |
|
|
| ### Super-resolution (540p) |
|
|
| ```python |
| from fastvideo import VideoGenerator |
| generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_540p", num_gpus=1) |
| generator.generate_video(prompt="...", output_path="output_540p.mp4", save_video=True) |
| generator.shutdown() |
| ``` |
|
|
| ### Super-resolution (1080p) |
|
|
| ```python |
| from fastvideo import VideoGenerator |
| generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_1080p", num_gpus=1) |
| generator.generate_video(prompt="...", output_path="output_1080p.mp4", save_video=True) |
| generator.shutdown() |
| ``` |
|
|
| For full runnable examples for all eight (variant × mode) combinations see |
| [`examples/inference/basic/basic_magi_human*.py`](https://github.com/hao-ai-lab/FastVideo/tree/will/magi/examples/inference/basic). |
|
|
| ## Lazy-load contract — what FastVideo fetches |
|
|
| Each subfolder of this repo only ships variant-specific weights: |
|
|
| ``` |
| <subfolder>/ |
| ├── model_index.json |
| ├── transformer/ ← variant DiT weights |
| ├── scheduler/ ← FlowUniPCMultistepScheduler config |
| └── sr_transformer/ ← only in sr_540p/, sr_1080p/ |
| ``` |
|
|
| The four cross-variant shared components (~25 GB total) are **lazy-loaded from |
| their canonical upstream HF repos** the first time the pipeline runs: |
|
|
| | Component | Source | Gated? | |
| |---|---|---| |
| | Wan 2.2 TI2V-5B VAE (video decode) | [`Wan-AI/Wan2.2-TI2V-5B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers) | no | |
| | T5-Gemma 9B encoder + tokenizer | [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) | yes (Google terms of use) | |
| | Stable Audio Open 1.0 VAE (audio decode) | [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) | yes (Stability AI terms of use) | |
|
|
| Net effect: a user running all four variants downloads ~50 GB of variant |
| weights + a single ~25 GB shared cache, totaling ~75 GB instead of ~400 GB if |
| each variant bundled its own copies. |
|
|
| ## Re-converting from raw upstream weights |
|
|
| If you need to re-convert from the raw [GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman): |
|
|
| ```bash |
| # Each variant is converted individually; the umbrella layout is the |
| # concatenation of these four outputs. |
| python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \ |
| --source GAIR/daVinci-MagiHuman \ |
| --subfolder base \ |
| --output local_weights/base |
| |
| python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \ |
| --source GAIR/daVinci-MagiHuman \ |
| --subfolder distill \ |
| --output local_weights/distill \ |
| --cast-bf16 |
| |
| python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \ |
| --source GAIR/daVinci-MagiHuman \ |
| --subfolder base \ |
| --sr-source GAIR/daVinci-MagiHuman --sr-subfolder 540p_sr \ |
| --output local_weights/sr_540p \ |
| --cast-bf16 |
| |
| python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \ |
| --source GAIR/daVinci-MagiHuman \ |
| --subfolder base \ |
| --sr-source GAIR/daVinci-MagiHuman --sr-subfolder 1080p_sr \ |
| --output local_weights/sr_1080p \ |
| --cast-bf16 |
| ``` |
|
|
| Pass `--bundle-vae` / `--bundle-audio-vae` / `--bundle-text-encoder` if you |
| want a self-contained snapshot instead of relying on lazy-load. |
|
|
| ## Parity vs official daVinci-MagiHuman |
|
|
| All four variants pass FastVideo's local parity battery **bit-exact** |
| (`diff_max=0.0, diff_mean=0.0`) against the official reference DiT: |
|
|
| | Test | Result | |
| |---|---| |
| | `test_magi_human_dit_parity` (base) | bit-exact | |
| | `test_magi_human_distill_dit_parity` | bit-exact | |
| | `test_magi_human_pipeline_latent_parity` (base T2V) | bit-exact | |
| | `test_magi_human_ti2v_pipeline_latent_parity` | bit-exact | |
| | `test_magi_human_sr540p_pipeline_latent_parity[t2v / ti2v]` | bit-exact | |
| | `test_magi_human_sr1080p_pipeline_latent_parity[t2v / ti2v]` | bit-exact | |
| | `test_magi_human_t5gemma_parity` | bit-exact | |
| | `test_magi_human_sa_audio_parity` (FV + official) | bit-exact | |
| | `test_magi_human_vae_parity` (Wan VAE decode) | 8e-4 max (fp32 op-order drift, tracked) | |
|
|
| Block-sparse local-window attention for SR-1080p is implemented as a |
| 3-block accumulator over vanilla SDPA (per-frame video→local-video + |
| all-video→audio+text + audio+text→all), which mathematically matches |
| upstream's |
| [`magi_attention.api.flex_flash_attn_func`](https://github.com/SandAI-org/MagiAttention/blob/main/magi_attention/functional/flex_flash_attn.py) |
| contract for this 3-block layout. Bit-exact verified. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{davinci-magihuman-2026, |
| title = {Speed by Simplicity: A Single-Stream Architecture for Fast |
| Audio-Video Generative Foundation Model}, |
| author = {SII-GAIR and Sand.ai}, |
| journal= {arXiv preprint arXiv:2603.21986}, |
| year = {2026} |
| } |
| |
| @misc{fastvideo-magihuman-port, |
| title = {{daVinci-MagiHuman} for {FastVideo}}, |
| author = {{FastVideo team}}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/FastVideo/MagiHuman-Diffusers}} |
| } |
| ``` |
|
|
| ## License |
|
|
| Apache 2.0 (matches upstream |
| [GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman)). |
|
|
| ## Acknowledgments |
|
|
| - [SII-GAIR](https://plms.ai) and [Sand.ai](https://sand.ai) for the original |
| daVinci-MagiHuman model and inference code. |
| - [Wan-AI](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) for the Wan 2.2 video |
| VAE. |
| - [Google](https://huggingface.co/google/t5gemma-9b-9b-ul2) for the T5-Gemma |
| text encoder. |
| - [Stability AI](https://huggingface.co/stabilityai/stable-audio-open-1.0) for |
| the Stable Audio Open 1.0 audio VAE. |
| - [SandAI-org / MagiAttention](https://github.com/SandAI-org/MagiAttention) for |
| the canonical FFA / `flex_flash_attn_func` reference. |
|
|