File size: 9,726 Bytes
7999927 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 | ---
license: apache-2.0
library_name: fastvideo
pipeline_tag: text-to-video
tags:
- text-to-video
- image-to-video
- text-to-audio
- super-resolution
- magi-human
- davinci
- sii-gair
- sand-ai
- audio-visual-generation
- multimodal
base_model:
- GAIR/daVinci-MagiHuman
---
# daVinci-MagiHuman — FastVideo Diffusers port
A FastVideo-format port of [SII-GAIR + Sand.ai's daVinci-MagiHuman](https://github.com/GAIR-NLP/daVinci-MagiHuman) joint
audio-visual generative model. Single repo, four sibling subfolders, one umbrella
HF string per variant, all four bit-exact vs the official reference.
> 15B-parameter single-stream transformer that jointly denoises video + audio in
> a unified token sequence. Generates a 5-second 256p clip with synchronized
> audio in ~2 s on a single H100. See the
> [paper](https://arxiv.org/abs/2603.21986) and the
> [official repo](https://github.com/GAIR-NLP/daVinci-MagiHuman).
## Variant matrix
| Subfolder | Model | Inference modes | Steps | CFG | Output | DiT files |
|---|---|---|---|---|---|---|
| `base/` | base 15B | T2V, TI2V | 32 | CFG=2 | 480x256 mp4 (video + audio) | 7 |
| `distill/` | DMD-2 distilled 15B | T2V, TI2V | 8 | no CFG | 480x256 mp4 (video + audio) | 7 |
| `sr_540p/` | base + SR 540p | T2V, TI2V | 32 + 5 | CFG=2 + SR cfg-trick | ~896x512 mp4 (video + audio) | 20 |
| `sr_1080p/` | base + SR 1080p (block-sparse local-window attention on 32/40 SR DiT layers) | T2V, TI2V | 32 + 5 | CFG=2 + SR cfg-trick | ~1920x1088 mp4 (video + audio) | 15 |
`T2V` = text only. `TI2V` = text + reference image; the image is encoded
through the Wan VAE and stitched into the first video latent frame at every
denoise step (matches upstream `evaluate_with_latent` per-step overwrite).
All four DiTs share the same architecture (40 layers, hidden=5120, head_dim=128,
GQA num_query_groups=8); only the weights differ. SR-1080p additionally
restricts video→video attention to a local window of `frame_receptive_field=11`
on 32 of 40 SR DiT layers (matches upstream's `SR2_1080` config override).
## Quick start
Install [FastVideo](https://github.com/hao-ai-lab/FastVideo) (commit `c05c1048`
or later in the `will/magi` branch contains all four variants):
```bash
uv pip install fastvideo
# or pin: uv pip install 'fastvideo @ git+https://github.com/hao-ai-lab/FastVideo@will/magi'
```
Accept terms on the two gated upstream repos that the pipeline lazy-loads from:
- [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) — text encoder + tokenizer
- [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) — audio VAE
Then export your `HF_TOKEN` and run any of:
### Base T2V (~5 s on H100)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained(
"FastVideo/MagiHuman-Diffusers/base",
num_gpus=1,
)
generator.generate_video(
prompt="A warm afternoon scene: a person sits on a park bench reading a book, "
"surrounded by softly swaying trees.",
output_path="output.mp4",
save_video=True,
)
generator.shutdown()
```
### Distill T2V (~2 s on H100, no CFG)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/distill", num_gpus=1)
generator.generate_video(prompt="...", output_path="output.mp4", save_video=True)
generator.shutdown()
```
### Base TI2V (text + reference image)
```python
from fastvideo import VideoGenerator
from fastvideo.pipelines.basic.magi_human.pipeline_configs import MagiHumanBaseI2VConfig
generator = VideoGenerator.from_pretrained(
"FastVideo/MagiHuman-Diffusers/base",
num_gpus=1,
workload_type="i2v",
override_pipeline_cls_name="MagiHumanI2VPipeline",
pipeline_config=MagiHumanBaseI2VConfig(),
)
generator.generate_video(
prompt="A cheerful saxophonist performs a short line in a small jazz club.",
image_path="reference.jpg",
output_path="output.mp4",
save_video=True,
)
generator.shutdown()
```
### Super-resolution (540p)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_540p", num_gpus=1)
generator.generate_video(prompt="...", output_path="output_540p.mp4", save_video=True)
generator.shutdown()
```
### Super-resolution (1080p)
```python
from fastvideo import VideoGenerator
generator = VideoGenerator.from_pretrained("FastVideo/MagiHuman-Diffusers/sr_1080p", num_gpus=1)
generator.generate_video(prompt="...", output_path="output_1080p.mp4", save_video=True)
generator.shutdown()
```
For full runnable examples for all eight (variant × mode) combinations see
[`examples/inference/basic/basic_magi_human*.py`](https://github.com/hao-ai-lab/FastVideo/tree/will/magi/examples/inference/basic).
## Lazy-load contract — what FastVideo fetches
Each subfolder of this repo only ships variant-specific weights:
```
<subfolder>/
├── model_index.json
├── transformer/ ← variant DiT weights
├── scheduler/ ← FlowUniPCMultistepScheduler config
└── sr_transformer/ ← only in sr_540p/, sr_1080p/
```
The four cross-variant shared components (~25 GB total) are **lazy-loaded from
their canonical upstream HF repos** the first time the pipeline runs:
| Component | Source | Gated? |
|---|---|---|
| Wan 2.2 TI2V-5B VAE (video decode) | [`Wan-AI/Wan2.2-TI2V-5B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers) | no |
| T5-Gemma 9B encoder + tokenizer | [`google/t5gemma-9b-9b-ul2`](https://huggingface.co/google/t5gemma-9b-9b-ul2) | yes (Google terms of use) |
| Stable Audio Open 1.0 VAE (audio decode) | [`stabilityai/stable-audio-open-1.0`](https://huggingface.co/stabilityai/stable-audio-open-1.0) | yes (Stability AI terms of use) |
Net effect: a user running all four variants downloads ~50 GB of variant
weights + a single ~25 GB shared cache, totaling ~75 GB instead of ~400 GB if
each variant bundled its own copies.
## Re-converting from raw upstream weights
If you need to re-convert from the raw [GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman):
```bash
# Each variant is converted individually; the umbrella layout is the
# concatenation of these four outputs.
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder base \
--output local_weights/base
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder distill \
--output local_weights/distill \
--cast-bf16
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder base \
--sr-source GAIR/daVinci-MagiHuman --sr-subfolder 540p_sr \
--output local_weights/sr_540p \
--cast-bf16
python scripts/checkpoint_conversion/convert_magi_human_to_diffusers.py \
--source GAIR/daVinci-MagiHuman \
--subfolder base \
--sr-source GAIR/daVinci-MagiHuman --sr-subfolder 1080p_sr \
--output local_weights/sr_1080p \
--cast-bf16
```
Pass `--bundle-vae` / `--bundle-audio-vae` / `--bundle-text-encoder` if you
want a self-contained snapshot instead of relying on lazy-load.
## Parity vs official daVinci-MagiHuman
All four variants pass FastVideo's local parity battery **bit-exact**
(`diff_max=0.0, diff_mean=0.0`) against the official reference DiT:
| Test | Result |
|---|---|
| `test_magi_human_dit_parity` (base) | bit-exact |
| `test_magi_human_distill_dit_parity` | bit-exact |
| `test_magi_human_pipeline_latent_parity` (base T2V) | bit-exact |
| `test_magi_human_ti2v_pipeline_latent_parity` | bit-exact |
| `test_magi_human_sr540p_pipeline_latent_parity[t2v / ti2v]` | bit-exact |
| `test_magi_human_sr1080p_pipeline_latent_parity[t2v / ti2v]` | bit-exact |
| `test_magi_human_t5gemma_parity` | bit-exact |
| `test_magi_human_sa_audio_parity` (FV + official) | bit-exact |
| `test_magi_human_vae_parity` (Wan VAE decode) | 8e-4 max (fp32 op-order drift, tracked) |
Block-sparse local-window attention for SR-1080p is implemented as a
3-block accumulator over vanilla SDPA (per-frame video→local-video +
all-video→audio+text + audio+text→all), which mathematically matches
upstream's
[`magi_attention.api.flex_flash_attn_func`](https://github.com/SandAI-org/MagiAttention/blob/main/magi_attention/functional/flex_flash_attn.py)
contract for this 3-block layout. Bit-exact verified.
## Citation
```bibtex
@article{davinci-magihuman-2026,
title = {Speed by Simplicity: A Single-Stream Architecture for Fast
Audio-Video Generative Foundation Model},
author = {SII-GAIR and Sand.ai},
journal= {arXiv preprint arXiv:2603.21986},
year = {2026}
}
@misc{fastvideo-magihuman-port,
title = {{daVinci-MagiHuman} for {FastVideo}},
author = {{FastVideo team}},
year = {2026},
howpublished = {\url{https://huggingface.co/FastVideo/MagiHuman-Diffusers}}
}
```
## License
Apache 2.0 (matches upstream
[GAIR/daVinci-MagiHuman](https://huggingface.co/GAIR/daVinci-MagiHuman)).
## Acknowledgments
- [SII-GAIR](https://plms.ai) and [Sand.ai](https://sand.ai) for the original
daVinci-MagiHuman model and inference code.
- [Wan-AI](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B) for the Wan 2.2 video
VAE.
- [Google](https://huggingface.co/google/t5gemma-9b-9b-ul2) for the T5-Gemma
text encoder.
- [Stability AI](https://huggingface.co/stabilityai/stable-audio-open-1.0) for
the Stable Audio Open 1.0 audio VAE.
- [SandAI-org / MagiAttention](https://github.com/SandAI-org/MagiAttention) for
the canonical FFA / `flex_flash_attn_func` reference.
|