How to use from the
Use from the
Diffusers library
pip install -U diffusers transformers accelerate
import torch
from diffusers import DiffusionPipeline

# switch to "mps" for apple devices
pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-NVFP4", dtype=torch.bfloat16, device_map="cuda")

prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt).images[0]

Z-Image Turbo - NVFP4 Mixed-Precision

Surgical mixed-precision quantization of Z-Image Turbo (6B S3-DiT), generated with convert_to_quant.

Formats: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers).
Size: 4.84 GB (-58% vs BF16).
Inference: ComfyUI + comfy-kitchen, Blackwell GPU (RTX 50xx / B100 / B200).

Also available: MXFP8 uniform quantization (6.23 GB, near-lossless).

BF16 vs NFVP4 NVFP4 vs NFVP4 plus rank 32 LoRA

  • Prompt:
A bust portrait of a woman in her mid-twenties with messy dark hair tied in a loose bun, wearing a worn denim jacket over a gray hoodie.
She is leaning her elbows on a washing machine, her chin resting on her folded hands. Behind her, a row of industrial dryers against a tiled wall,
with one dryer door hanging open. Above the dryers, a handwritten sign taped to the wall says 'OUT OF ORDER' in black marker,
with a small smiley face drawn on it. To her left, a plastic basket overflows with unfolded clothes. To her right, a vending machine glows green,
displaying 'SOAP $1.50' on a small digital screen. The light is cool and buzzing, like fluorescent tubes overhead. She looks tired but amused
with a faint smirk.
  • Sampler/Scheduler: Euler/Simple
  • Steps: 9
  • CFG: 1.0
  • Shift: 3.0
  • Seed: 920698660737993
  • Resolution: 1024 x 1536

Strategy

Uses per-layer sensitivity analysis via quant_probe and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte:

  • ~190 tensors β†’ NVFP4 (4-bit E2M1): baseline for most attention + FF weights
  • ~100 tensors β†’ MXFP8 (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN
  • ~20 tensors β†’ BF16: last QKV, late adaLN modulations, refiner outputs
  • ~110 tensors β†’ BF16: norms, biases, embeddings (auto-excluded by --zimage)

MXFP8-protected layers

Category Blocks Layers
Early attention outputs 0, 1 attention.out
Selected QKV projections 10, 16, 26, 27, 28 attention.qkv
Attention outputs 3, 6, 9, 11–14, 19, 20, 26–29 attention.out
Gate projections (w1) 3–29 feed_forward.w1
Mid-block modulations 16–21 adaLN_modulation.0

BF16-protected layers

Category Layers Reason
Last QKV layers.29.attention.qkv Feeds directly into final_layer β€” no downstream compensation
Late modulations layers.(22–29).adaLN_modulation.0 Controls scale/shift of features near output
Refiner attention outputs context_refiner.(0|1).attention.out Only 2 refiner blocks β€” outputs have outsized impact
Selected refiner FF context_refiner.1.w2, noise_refiner.1.{qkv,out,w2} Critical single-block projections
Refiner up-projections noise_refiner.(0|1).w3 Noise refiner w3 expands features β†’ direct output

Refiner sub-graphs

Sub-graph Block 0 Block 1
context_refiner All MXFP8 (qkv, w1, w2, w3) qkv + w1 + w3 MXFP8, out + w2 BF16
noise_refiner qkv + out + w1 + w2 MXFP8, w3 BF16 qkv + out + w2 + w3 BF16, w1 MXFP8

Generation

#!/bin/bash
# NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points.
# Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16.
# Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16.
# All main-trunk w1 (gate) projections in MXFP8.
convert_to_quant -i $1 \
  --nvfp4 --zimage --comfy_quant --save-quant-metadata \
  --custom-type mxfp8 \
  --custom-layers "layers\.(10|16|26)\.attention\.qkv\.weight|layers\.(27|28)\.attention\.qkv\.weight|layers\.(0|1)\.attention\.out\.weight|layers\.(3|6|9|11|12|13|14|19|20|26)\.attention\.out\.weight|layers\.(27|28|29)\.attention\.out\.weight|layers\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26)\.feed_forward\.w1\.weight|layers\.(27|28|29)\.feed_forward\.w1\.weight|layers\.(16|17|18|19|20|21)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.qkv\.weight|context_refiner\.(0|1)\.feed_forward\.w1\.weight|context_refiner\.(0|1)\.feed_forward\.w2\.weight|context_refiner\.(0|1)\.feed_forward\.w3\.weight|noise_refiner\.(0)\.attention\.(qkv|out)\.weight|noise_refiner\.(0)\.feed_forward\.(w1|w2)\.weight|noise_refiner\.(1)\.feed_forward\.w1\.weight" \
  --exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
  --num-iter 6000 --top-p 0.35 --calib-samples 8192 \
  --scale-optimization iterative --scale-refinement-rounds 2 \
  --extract-lora --lora-rank 32 \
  -o "${1%%.safetensors}-nvfp4.safetensors"

Included files

File Description
z_image_turbo_nvfp4.safetensors Quantized weights
z_image_turbo_nvfp4_lora.safetensors Error-correction LoRA (rank 32)

Use the LoRA with variable strength in ComfyUI for improved fidelity.

Requirements

  • Inference: CUDA 13.0+, PyTorch 2.10+, comfy-kitchen, Blackwell GPU (RTX 50xx / B100 / B200)
  • Generation: convert_to_quant >= 1.2.6, comfy-kitchen

Comparison

NVFP4 Mixed (this) MXFP8 Uniform Official NVFP4
Size 4.84 GB 6.23 GB 4.51 GB
Base format NVFP4 (4-bit) MXFP8 (8-bit) NVFP4 (4-bit)
Custom layers ~100 tensors β†’ MXFP8 None None
BF16 exclusions ~20 tensors 8 patterns Refiners fully BF16
Learned rounding βœ… 6000 iter ❌ --simple ❌
LoRA βœ… rank 32 ❌ ❌
Refiner block 0 MXFP8 MXFP8 BF16
Late adaLN (22–29) BF16 BF16 NVFP4 ⚠️
Last QKV (layer 29) BF16 BF16 NVFP4 ⚠️
Quantization timeΒΉ ~60–90 min ~5–10 min N/A

ΒΉ Estimated on RTX 5060 (Blackwell) with comfy-kitchen CUDA kernels.

Methodology

Layer sensitivity was analyzed using quant_probe, which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend *KEEP*, FP8, or NVFP4.

Recommendations were cross-referenced against the DiT quantization literature:

  • PTQ4DiT (NeurIPS 2024) β€” salient channels in QKV + FFN, last blocks most affected
  • ViDiT-Q (ICLR 2025) β€” metric-decoupled sensitivity: self-attention dominates visual quality
  • HTG (2025) β€” channel-dependent outliers, severe in later blocks
  • SemanticDialect (2026) β€” block-wise mixed-format validated for video DiTs
  • SVDQuant (ICLR 2025) β€” low-rank branch absorbs 4-bit error, validated NVFP4

Credits

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for InsecureErasure/Z-Image-Turbo-NVFP4

Quantized
(53)
this model