Instructions to use InsecureErasure/Z-Image-Turbo-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use InsecureErasure/Z-Image-Turbo-NVFP4 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("InsecureErasure/Z-Image-Turbo-NVFP4", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Z-Image Turbo - NVFP4 Mixed-Precision
Surgical mixed-precision quantization of Z-Image Turbo (6B S3-DiT), generated with convert_to_quant.
Formats: NVFP4 (baseline) + MXFP8 (sensitive layers) + BF16 (critical layers).
Size: 4.84 GB (-58% vs BF16).
Inference: ComfyUI + comfy-kitchen, Blackwell GPU (RTX 50xx / B100 / B200).
Also available: MXFP8 uniform quantization (6.23 GB, near-lossless).
- Prompt:
A bust portrait of a woman in her mid-twenties with messy dark hair tied in a loose bun, wearing a worn denim jacket over a gray hoodie.
She is leaning her elbows on a washing machine, her chin resting on her folded hands. Behind her, a row of industrial dryers against a tiled wall,
with one dryer door hanging open. Above the dryers, a handwritten sign taped to the wall says 'OUT OF ORDER' in black marker,
with a small smiley face drawn on it. To her left, a plastic basket overflows with unfolded clothes. To her right, a vending machine glows green,
displaying 'SOAP $1.50' on a small digital screen. The light is cool and buzzing, like fluorescent tubes overhead. She looks tired but amused
with a faint smirk.
- Sampler/Scheduler: Euler/Simple
- Steps: 9
- CFG: 1.0
- Shift: 3.0
- Seed: 920698660737993
- Resolution: 1024 x 1536
Strategy
Uses per-layer sensitivity analysis via quant_probe and the DiT quantization literature (PTQ4DiT, ViDiT-Q, SemanticDialect, SVDQuant) to maximize quality-per-byte:
- ~190 tensors β NVFP4 (4-bit E2M1): baseline for most attention + FF weights
- ~100 tensors β MXFP8 (8-bit E4M3 + E8M0): attention outputs, gate projections (w1), mid-block adaLN
- ~20 tensors β BF16: last QKV, late adaLN modulations, refiner outputs
- ~110 tensors β BF16: norms, biases, embeddings (auto-excluded by
--zimage)
MXFP8-protected layers
| Category | Blocks | Layers |
|---|---|---|
| Early attention outputs | 0, 1 | attention.out |
| Selected QKV projections | 10, 16, 26, 27, 28 | attention.qkv |
| Attention outputs | 3, 6, 9, 11β14, 19, 20, 26β29 | attention.out |
| Gate projections (w1) | 3β29 | feed_forward.w1 |
| Mid-block modulations | 16β21 | adaLN_modulation.0 |
BF16-protected layers
| Category | Layers | Reason |
|---|---|---|
| Last QKV | layers.29.attention.qkv |
Feeds directly into final_layer β no downstream compensation |
| Late modulations | layers.(22β29).adaLN_modulation.0 |
Controls scale/shift of features near output |
| Refiner attention outputs | context_refiner.(0|1).attention.out |
Only 2 refiner blocks β outputs have outsized impact |
| Selected refiner FF | context_refiner.1.w2, noise_refiner.1.{qkv,out,w2} |
Critical single-block projections |
| Refiner up-projections | noise_refiner.(0|1).w3 |
Noise refiner w3 expands features β direct output |
Refiner sub-graphs
| Sub-graph | Block 0 | Block 1 |
|---|---|---|
context_refiner |
All MXFP8 (qkv, w1, w2, w3) | qkv + w1 + w3 MXFP8, out + w2 BF16 |
noise_refiner |
qkv + out + w1 + w2 MXFP8, w3 BF16 | qkv + out + w2 + w3 BF16, w1 MXFP8 |
Generation
#!/bin/bash
# NVFP4 baseline + MXFP8 for sensitive layers + BF16 at critical points.
# Refiners: block 0 fully MXFP8, block 1 outputs kept in BF16.
# Last QKV (layer 29), late adaLN (22-29), and refiner outputs in BF16.
# All main-trunk w1 (gate) projections in MXFP8.
convert_to_quant -i $1 \
--nvfp4 --zimage --comfy_quant --save-quant-metadata \
--custom-type mxfp8 \
--custom-layers "layers\.(10|16|26)\.attention\.qkv\.weight|layers\.(27|28)\.attention\.qkv\.weight|layers\.(0|1)\.attention\.out\.weight|layers\.(3|6|9|11|12|13|14|19|20|26)\.attention\.out\.weight|layers\.(27|28|29)\.attention\.out\.weight|layers\.(3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26)\.feed_forward\.w1\.weight|layers\.(27|28|29)\.feed_forward\.w1\.weight|layers\.(16|17|18|19|20|21)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.qkv\.weight|context_refiner\.(0|1)\.feed_forward\.w1\.weight|context_refiner\.(0|1)\.feed_forward\.w2\.weight|context_refiner\.(0|1)\.feed_forward\.w3\.weight|noise_refiner\.(0)\.attention\.(qkv|out)\.weight|noise_refiner\.(0)\.feed_forward\.(w1|w2)\.weight|noise_refiner\.(1)\.feed_forward\.w1\.weight" \
--exclude-layers "layers\.(29)\.attention\.qkv\.weight|layers\.(22|23|24|25|26)\.adaLN_modulation\.0\.weight|layers\.(27|28|29)\.adaLN_modulation\.0\.weight|context_refiner\.(0|1)\.attention\.out\.weight|context_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(1)\.attention\.qkv\.weight|noise_refiner\.(1)\.attention\.out\.weight|noise_refiner\.(1)\.feed_forward\.w2\.weight|noise_refiner\.(0|1)\.feed_forward\.w3\.weight" \
--num-iter 6000 --top-p 0.35 --calib-samples 8192 \
--scale-optimization iterative --scale-refinement-rounds 2 \
--extract-lora --lora-rank 32 \
-o "${1%%.safetensors}-nvfp4.safetensors"
Included files
| File | Description |
|---|---|
z_image_turbo_nvfp4.safetensors |
Quantized weights |
z_image_turbo_nvfp4_lora.safetensors |
Error-correction LoRA (rank 32) |
Use the LoRA with variable strength in ComfyUI for improved fidelity.
Requirements
- Inference: CUDA 13.0+, PyTorch 2.10+,
comfy-kitchen, Blackwell GPU (RTX 50xx / B100 / B200) - Generation:
convert_to_quant >= 1.2.6,comfy-kitchen
Comparison
| NVFP4 Mixed (this) | MXFP8 Uniform | Official NVFP4 | |
|---|---|---|---|
| Size | 4.84 GB | 6.23 GB | 4.51 GB |
| Base format | NVFP4 (4-bit) | MXFP8 (8-bit) | NVFP4 (4-bit) |
| Custom layers | ~100 tensors β MXFP8 | None | None |
| BF16 exclusions | ~20 tensors | 8 patterns | Refiners fully BF16 |
| Learned rounding | β 6000 iter | β --simple | β |
| LoRA | β rank 32 | β | β |
| Refiner block 0 | MXFP8 | MXFP8 | BF16 |
| Late adaLN (22β29) | BF16 | BF16 | NVFP4 β οΈ |
| Last QKV (layer 29) | BF16 | BF16 | NVFP4 β οΈ |
| Quantization timeΒΉ | ~60β90 min | ~5β10 min | N/A |
ΒΉ Estimated on RTX 5060 (Blackwell) with comfy-kitchen CUDA kernels.
Methodology
Layer sensitivity was analyzed using quant_probe, which computes per-tensor excess kurtosis, dynamic range, and aspect ratio, then scores them against the model's own distribution to recommend *KEEP*, FP8, or NVFP4.
Recommendations were cross-referenced against the DiT quantization literature:
- PTQ4DiT (NeurIPS 2024) β salient channels in QKV + FFN, last blocks most affected
- ViDiT-Q (ICLR 2025) β metric-decoupled sensitivity: self-attention dominates visual quality
- HTG (2025) β channel-dependent outliers, severe in later blocks
- SemanticDialect (2026) β block-wise mixed-format validated for video DiTs
- SVDQuant (ICLR 2025) β low-rank branch absorbs 4-bit error, validated NVFP4
Credits
- Quantization engine:
convert_to_quantby silveroxides - Z-Image Turbo model by Tongyi-MAI
- ComfyUI integration via
comfy-kitchen - Layer sensitivity analysis via
quant_probe
- Downloads last month
- -
Model tree for InsecureErasure/Z-Image-Turbo-NVFP4
Base model
Tongyi-MAI/Z-Image-Turbo
