HunyuanVideo ModelOpt FP8 SGLang Transformer

This repository contains a SGLang-ready ModelOpt FP8 transformer override for hunyuanvideo-community/HunyuanVideo. It only replaces the DiT/transformer weights; text encoders, VAE, scheduler, tokenizer, and other non-transformer components are loaded from the original base model.

The checkpoint is intended for SGLang Diffusion with the HunyuanVideo FP8 support from sgl-project/sglang#23199.

Usage

sglang generate \
  --backend=sglang \
  --model-path hunyuanvideo-community/HunyuanVideo \
  --transformer-path BBuf/HunyuanVideo-ModelOpt-FP8-SGLang \
  --prompt "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." \
  --seed=42 \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-frames=65 \
  --fps=13 \
  --width=848 \
  --height=480 \
  --num-inference-steps=30 \
  --save-output \
  --warmup \
  --enable-torch-compile

The command above follows the HunyuanVideo preset used by the sglang-diffusion-benchmark-profile skill. The --num-frames=65 --fps=13 pair gives an exact 5.000s video.

H100 Validation Snapshot

Validation was run on one H100 GPU using rank0 (CUDA_VISIBLE_DEVICES=0) with --backend=sglang. Logs show Using pipeline from model_index.json: HunyuanVideoPipeline; no diffusers fallback markers were observed.

Artifacts:

Validation tree: validation/h100_skill_5s_20260420
Full command and run summary: result_summary_skill_5s.md
BF16 video: hunyuanvideo_bf16_skill_5s.mp4
FP8 video: hunyuanvideo_fp8_skill_5s.mp4
Profiler traces: BF16, FP8, kernel summary

Benchmark, warmup excluded:

Metric	BF16	FP8	Delta	Speedup
E2E latency	59.546 s	54.748 s	-4.798 s (-8.1%)	1.09x
Denoising stage	42.542 s	37.980 s	-4.562 s (-10.7%)	1.12x
Avg denoise step	1.4180 s	1.2659 s	-0.1521 s	1.12x
Decoding stage	16.692 s	16.458 s	-0.233 s (-1.4%)	1.01x
Text encoding	0.308 s	0.306 s	-0.002 s (-0.7%)	1.01x

Profiler kernel share over 5 profiled denoise timesteps. Profiler timings include profiling overhead and are not used as benchmark latency numbers.

Precision	Total CUDA op time	Top CUDA/kernel shares
BF16	17.055 s	`cudaMemcpyAsync` 41.54%; FlashAttention 31.99%; BF16 GEMM kernels 9.77%, 8.16%, 2.11%
FP8	15.324 s	`cudaMemcpyAsync` 40.62%; FlashAttention 36.80%; FP8 Cutlass GEMM 12.83%; `_static_quant_fp8` 1.37%

Conversion Notes

The checkpoint was converted from a ModelOpt FP8 export with SGLang's build_modelopt_fp8_transformer tool using the hunyuan-video preset. The preset keeps numerically sensitive embedder, modulation, and output layers in BF16, and maps ModelOpt/diffusers module names to SGLang runtime module names for fused QKV and fused QKV+MLP projections.

One runtime caveat: the CLI can keep the same offload flags as the BF16 skill preset, but ModelOpt FP8 checkpoints currently force dit_cpu_offload off while preserving layerwise offload behavior for restored FP8 tensor strides.

Downloads last month: 55

Model tree for BBuf/HunyuanVideo-ModelOpt-FP8-SGLang

Base model

tencent/HunyuanVideo

Finetuned

hunyuanvideo-community/HunyuanVideo

Quantized

(2)

this model