nvidia
/

PixelDiT-ImageNet

+---
+license: other
+license_name: nscl-v1
+license_link: LICENSE
+tags:
+- image-generation
+- class-conditional
+- diffusion
+- pixel-space
+- dit
+- imagenet
+library_name: pytorch
+pipeline_tag: unconditional-image-generation
+---
+<p align="center">
+  <img src="https://raw.githubusercontent.com/pixeldit/pixeldit.github.io/main/static/images/pixeldit-logo.png" height="120" />
+</p>
+<h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
+<p align="center">
+  <a href="https://www.yongshengyu.com/">Yongsheng Yu</a><sup>1,2</sup> &nbsp;
+  <a href="https://wxiong.me/">Wei Xiong</a><sup>1†</sup> &nbsp;
+  <a href="https://weilinie.github.io/">Weili Nie</a><sup>1</sup> &nbsp;
+  <a href="https://shengcn.github.io/">Yichen Sheng</a><sup>1</sup> &nbsp;
+  <a href="http://behindthepixels.io/">Shiqiu Liu</a><sup>1</sup> &nbsp;
+  <a href="https://www.cs.rochester.edu/u/jluo/">Jiebo Luo</a><sup>2</sup>
+</p>
+<p align="center">
+  <sup>1</sup>NVIDIA &nbsp; <sup>2</sup>University of Rochester
+  <br>
+  <sup>†</sup>Project Lead and Main Advising
+</p>
+<p align="center">
+  <a href="https://pixeldit.github.io/"><img src="https://img.shields.io/badge/Website-Project_Page-2ea44f" /></a>
+  &nbsp;
+  <a href="https://arxiv.org/abs/2511.20645"><img src="https://img.shields.io/badge/arXiv-2511.20645-b31b1b.svg" /></a>
+  &nbsp;
+  <a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
+</p>
+## Model Overview
+**PixelDiT-XL** (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details.
+## Pre-trained Checkpoints
+| Checkpoint | Resolution | Epochs | gFID | CFG Scale | Time Shift | CFG Interval |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|
+| `imagenet256_pixeldit_xl_epoch80.ckpt` | 256x256 | 80 | **2.36** | 3.25 | 1.0 | [0.1, 1.0] |
+| `imagenet256_pixeldit_xl_epoch160.ckpt` | 256x256 | 160 | **1.97** | 3.25 | 1.0 | [0.1, 1.0] |
+| `imagenet256_pixeldit_xl_epoch320.ckpt` | 256x256 | 320 | **1.61** | 2.75 | 1.0 | [0.1, 0.9] |
+| `imagenet512_pixeldit_xl.ckpt` | 512x512 | 850 | **1.78** | 3.5 | 2.0 | [0.1, 1.0] |
+All evaluations use **FlowDPMSolver** with **100 steps**. 50K samples. Metrics follow the ADM evaluation protocol.
+## Usage
+### Installation
+```bash
+pip install torch torchvision lightning omegaconf timm wandb h5py
+```
+### Evaluation (Generate 50K Samples)
+```bash
+cd c2i/
+# ImageNet 256x256 (epoch 320, best FID)
+torchrun --nproc_per_node=8 main.py predict \
+  -c configs/pix256_xl.yaml \
+  --ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \
+  --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
+  --model.diffusion_sampler.init_args.num_steps=100 \
+  --model.diffusion_sampler.init_args.guidance=2.75 \
+  --model.diffusion_sampler.init_args.timeshift=1.0 \
+  --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
+  --model.diffusion_sampler.init_args.guidance_interval_max=0.9 \
+  --per_run_seed=false --seed_everything=1000
+# ImageNet 512x512
+torchrun --nproc_per_node=8 main.py predict \
+  -c configs/pix512_xl.yaml \
+  --ckpt_path=imagenet512_pixeldit_xl.ckpt \
+  --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
+  --model.diffusion_sampler.init_args.num_steps=100 \
+  --model.diffusion_sampler.init_args.guidance=3.5 \
+  --model.diffusion_sampler.init_args.timeshift=2.0 \
+  --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
+  --model.diffusion_sampler.init_args.guidance_interval_max=1.0 \
+  --per_run_seed=false --seed_everything=10000
+```
+After generating samples, compute FID with the [ADM evaluation toolkit](https://github.com/openai/guided-diffusion/tree/main/evaluations).
+## Model Architecture
+| Component | Value |
+|-----------|-------|
+| Parameters | 797M |
+| Input channels | 3 (RGB) |
+| Patch size | 16 |
+| Hidden size | 1152 |
+| Attention heads | 16 |
+| Patch-level depth | 26 |
+| Pixel-level depth | 4 |
+| Pixel hidden size | 16 |
+| Classes | 1000 (ImageNet) |
+## Citation
+```bibtex
+@misc{yu2025pixeldit,
+      title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
+      author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
+      year={2025},
+      eprint={2511.20645},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2511.20645},
+}
+```
+## License
+This model is released under the [NVIDIA OneWay Non-Commercial License](LICENSE). The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.