HunyuanVideo ModelOpt FP8 SGLang Transformer
This repository contains a SGLang-ready ModelOpt FP8 transformer override for hunyuanvideo-community/HunyuanVideo.
It only replaces the DiT/transformer weights; text encoders, VAE, scheduler, tokenizer, and other non-transformer components are loaded from the original base model.
The checkpoint is intended for SGLang Diffusion with the HunyuanVideo FP8 support from sgl-project/sglang#23199.
Usage
sglang generate \
--backend=sglang \
--model-path hunyuanvideo-community/HunyuanVideo \
--transformer-path BBuf/HunyuanVideo-ModelOpt-FP8-SGLang \
--prompt "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." \
--seed=42 \
--text-encoder-cpu-offload \
--pin-cpu-memory \
--num-frames=65 \
--fps=13 \
--width=848 \
--height=480 \
--num-inference-steps=30 \
--save-output \
--warmup \
--enable-torch-compile
The command above follows the HunyuanVideo preset used by the sglang-diffusion-benchmark-profile skill. The --num-frames=65 --fps=13 pair gives an exact 5.000s video.
H100 Validation Snapshot
Validation was run on one H100 GPU using rank0 (CUDA_VISIBLE_DEVICES=0) with --backend=sglang. Logs show Using pipeline from model_index.json: HunyuanVideoPipeline; no diffusers fallback markers were observed.
Artifacts:
- Validation tree:
validation/h100_skill_5s_20260420 - Full command and run summary:
result_summary_skill_5s.md - BF16 video:
hunyuanvideo_bf16_skill_5s.mp4 - FP8 video:
hunyuanvideo_fp8_skill_5s.mp4 - Profiler traces: BF16, FP8, kernel summary
Benchmark, warmup excluded:
| Metric | BF16 | FP8 | Delta | Speedup |
|---|---|---|---|---|
| E2E latency | 59.546 s | 54.748 s | -4.798 s (-8.1%) | 1.09x |
| Denoising stage | 42.542 s | 37.980 s | -4.562 s (-10.7%) | 1.12x |
| Avg denoise step | 1.4180 s | 1.2659 s | -0.1521 s | 1.12x |
| Decoding stage | 16.692 s | 16.458 s | -0.233 s (-1.4%) | 1.01x |
| Text encoding | 0.308 s | 0.306 s | -0.002 s (-0.7%) | 1.01x |
Profiler kernel share over 5 profiled denoise timesteps. Profiler timings include profiling overhead and are not used as benchmark latency numbers.
| Precision | Total CUDA op time | Top CUDA/kernel shares |
|---|---|---|
| BF16 | 17.055 s | cudaMemcpyAsync 41.54%; FlashAttention 31.99%; BF16 GEMM kernels 9.77%, 8.16%, 2.11% |
| FP8 | 15.324 s | cudaMemcpyAsync 40.62%; FlashAttention 36.80%; FP8 Cutlass GEMM 12.83%; _static_quant_fp8 1.37% |
Conversion Notes
The checkpoint was converted from a ModelOpt FP8 export with SGLang's build_modelopt_fp8_transformer tool using the hunyuan-video preset.
The preset keeps numerically sensitive embedder, modulation, and output layers in BF16, and maps ModelOpt/diffusers module names to SGLang runtime module names for fused QKV and fused QKV+MLP projections.
One runtime caveat: the CLI can keep the same offload flags as the BF16 skill preset, but ModelOpt FP8 checkpoints currently force dit_cpu_offload off while preserving layerwise offload behavior for restored FP8 tensor strides.
- Downloads last month
- 55
Model tree for BBuf/HunyuanVideo-ModelOpt-FP8-SGLang
Base model
tencent/HunyuanVideo