HunyuanVideo ModelOpt FP8 SGLang Transformer

This repository contains a SGLang-ready ModelOpt FP8 transformer override for hunyuanvideo-community/HunyuanVideo. It only replaces the DiT/transformer weights; text encoders, VAE, scheduler, tokenizer, and other non-transformer components are loaded from the original base model.

The checkpoint is intended for SGLang Diffusion with the HunyuanVideo FP8 support from sgl-project/sglang#23199.

Usage

sglang generate \
  --backend=sglang \
  --model-path hunyuanvideo-community/HunyuanVideo \
  --transformer-path BBuf/HunyuanVideo-ModelOpt-FP8-SGLang \
  --prompt "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window." \
  --seed=42 \
  --text-encoder-cpu-offload \
  --pin-cpu-memory \
  --num-frames=65 \
  --fps=13 \
  --width=848 \
  --height=480 \
  --num-inference-steps=30 \
  --save-output \
  --warmup \
  --enable-torch-compile

The command above follows the HunyuanVideo preset used by the sglang-diffusion-benchmark-profile skill. The --num-frames=65 --fps=13 pair gives an exact 5.000s video.

H100 Validation Snapshot

Validation was run on one H100 GPU using rank0 (CUDA_VISIBLE_DEVICES=0) with --backend=sglang. Logs show Using pipeline from model_index.json: HunyuanVideoPipeline; no diffusers fallback markers were observed.

Artifacts:

BF16 vs FP8 5s contact sheet

Benchmark, warmup excluded:

Metric BF16 FP8 Delta Speedup
E2E latency 59.546 s 54.748 s -4.798 s (-8.1%) 1.09x
Denoising stage 42.542 s 37.980 s -4.562 s (-10.7%) 1.12x
Avg denoise step 1.4180 s 1.2659 s -0.1521 s 1.12x
Decoding stage 16.692 s 16.458 s -0.233 s (-1.4%) 1.01x
Text encoding 0.308 s 0.306 s -0.002 s (-0.7%) 1.01x

Profiler kernel share over 5 profiled denoise timesteps. Profiler timings include profiling overhead and are not used as benchmark latency numbers.

Precision Total CUDA op time Top CUDA/kernel shares
BF16 17.055 s cudaMemcpyAsync 41.54%; FlashAttention 31.99%; BF16 GEMM kernels 9.77%, 8.16%, 2.11%
FP8 15.324 s cudaMemcpyAsync 40.62%; FlashAttention 36.80%; FP8 Cutlass GEMM 12.83%; _static_quant_fp8 1.37%

Conversion Notes

The checkpoint was converted from a ModelOpt FP8 export with SGLang's build_modelopt_fp8_transformer tool using the hunyuan-video preset. The preset keeps numerically sensitive embedder, modulation, and output layers in BF16, and maps ModelOpt/diffusers module names to SGLang runtime module names for fused QKV and fused QKV+MLP projections.

One runtime caveat: the CLI can keep the same offload flags as the BF16 skill preset, but ModelOpt FP8 checkpoints currently force dit_cpu_offload off while preserving layerwise offload behavior for restored FP8 tensor strides.

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for BBuf/HunyuanVideo-ModelOpt-FP8-SGLang

Quantized
(2)
this model