PiD / README.md

Update README.md

1c6eee3 verified about 22 hours ago

6.33 kB

library_name: pytorch
tags:
  - super-resolution
  - diffusion
  - pixel-diffusion-decoder
  - vae-decoder
pipeline_tag: image-to-image
base_model:
  - nvidia/PixelDiT-1300M-1024px
  - Tongyi-MAI/Z-Image
  - black-forest-labs/FLUX.1-dev
  - black-forest-labs/FLUX.2-dev
  - nyu-visionx/Scale-RAE-Qwen7B_DiT9.8B

PiD — Pixel Diffusion Decoder

PiD teaser

Paper, Project Page

Yifan Lu, Qi Wu, Jay Zhangjie Wu, Zian Wang, Huan Ling, Sanja Fidler, Xuanchi Ren

PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It denoises directly in high-resolution pixel space and produces a super-resolved image in one pass. This repository hosts the released decoder checkpoints, plus the encoder/decoder ("VAE") weights they depend on.

All PiD_* checkpoints in this repo are 4-step distilled. The non-PiD_* entries (ae.safetensors, flux2_ae.safetensors, sd3_vae/, rae/, scale_rae/) are the corresponding encoder/decoder VAE weights that PiD plugs into — they're not PiD checkpoints themselves.

License/Terms of Use

This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

Deployment Geography:

Global

PiD checkpoints

Two variants are released for each diffusers-style backbone:

2k — trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as an 8× decoder for the Scale-RAE backbone (256 → 2048).
2kto4k — trained with multi-resolution data bucketing 2048→3840 and an SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.

Both checkpoint variants support multiple aspect ratios.

Path	Backbone (encoder side)	SR factor	Variant
`checkpoints/PiD_res2k_sr4x_official_flux_distill_4step`	Flux1-dev (16-ch VAE)	4×	2k
`checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step`	Flux2-dev (128-ch BN VAE)	4×	2k
`checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step`	SD3 medium (16-ch VAE)	4×	2k
`checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step`	DINOv2-B + RAE ViT-XL (768-ch)	4×	2k
`checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step`	SigLIP-2 So400M + Scale-RAE ViT-XL (1152)	8×	2k
`checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step`	Flux1-dev (16-ch VAE)	4×	2kto4k
`checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step`	Flux2-dev (128-ch BN VAE)	4×	2kto4k
`checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step`	SD3 medium (16-ch VAE)	4×	2kto4k

Z-Image shares Flux1's VAE, so its inference path reuses the flux checkpoints (both 2k and 2kto4k) — no separate zimage checkpoint is shipped.

Each directory contains a single file, model_ema_bf16.pth, which is the EMA weights cast to bfloat16 — the format the inference scripts load by default.

VAE / encoder weights

These are the per-backbone encoder (and, where applicable, original decoder) weights that PiD pairs with. They're hosted here so a single download brings everything needed end-to-end.

Path	Description
`checkpoints/ae.safetensors`	Flux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).
`checkpoints/flux2_ae.safetensors`	Flux2-dev 128-ch BN VAE.
`checkpoints/sd3_vae/`	SD3 medium 16-ch VAE in diffusers format.
`checkpoints/rae/`	DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics.
`checkpoints/scale_rae/`	SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.

Usage

The decoder checkpoints are loaded by the inference scripts in the PiD codebase. The exact (backbone, ckpt_type) → path mapping is the single source of truth in pid/_src/inference/checkpoint_registry.py — clone the repo, point it at this snapshot, and the demos pick the right file automatically:

# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"

# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic cat" \
    --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
    --output_dir ./results/demo \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4

Pick the 2kto4k variant via --pid_ckpt_type 2kto4k when decoding at 4K.

Citation

@article{lu2026pid,
    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2605.23902},
    year={2026}
}