Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,5 +1,129 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: other
|
| 3 |
-
license_name:
|
| 4 |
-
license_link: LICENSE
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: nscl-v1
|
| 4 |
+
license_link: LICENSE
|
| 5 |
+
tags:
|
| 6 |
+
- image-generation
|
| 7 |
+
- class-conditional
|
| 8 |
+
- diffusion
|
| 9 |
+
- pixel-space
|
| 10 |
+
- dit
|
| 11 |
+
- imagenet
|
| 12 |
+
library_name: pytorch
|
| 13 |
+
pipeline_tag: unconditional-image-generation
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
<p align="center">
|
| 17 |
+
<img src="https://raw.githubusercontent.com/pixeldit/pixeldit.github.io/main/static/images/pixeldit-logo.png" height="120" />
|
| 18 |
+
</p>
|
| 19 |
+
|
| 20 |
+
<h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
|
| 21 |
+
|
| 22 |
+
<p align="center">
|
| 23 |
+
<a href="https://www.yongshengyu.com/">Yongsheng Yu</a><sup>1,2</sup>
|
| 24 |
+
<a href="https://wxiong.me/">Wei Xiong</a><sup>1†</sup>
|
| 25 |
+
<a href="https://weilinie.github.io/">Weili Nie</a><sup>1</sup>
|
| 26 |
+
<a href="https://shengcn.github.io/">Yichen Sheng</a><sup>1</sup>
|
| 27 |
+
<a href="http://behindthepixels.io/">Shiqiu Liu</a><sup>1</sup>
|
| 28 |
+
<a href="https://www.cs.rochester.edu/u/jluo/">Jiebo Luo</a><sup>2</sup>
|
| 29 |
+
</p>
|
| 30 |
+
<p align="center">
|
| 31 |
+
<sup>1</sup>NVIDIA <sup>2</sup>University of Rochester
|
| 32 |
+
<br>
|
| 33 |
+
<sup>†</sup>Project Lead and Main Advising
|
| 34 |
+
</p>
|
| 35 |
+
|
| 36 |
+
<p align="center">
|
| 37 |
+
<a href="https://pixeldit.github.io/"><img src="https://img.shields.io/badge/Website-Project_Page-2ea44f" /></a>
|
| 38 |
+
|
| 39 |
+
<a href="https://arxiv.org/abs/2511.20645"><img src="https://img.shields.io/badge/arXiv-2511.20645-b31b1b.svg" /></a>
|
| 40 |
+
|
| 41 |
+
<a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
|
| 42 |
+
</p>
|
| 43 |
+
|
| 44 |
+
## Model Overview
|
| 45 |
+
|
| 46 |
+
**PixelDiT-XL** (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details.
|
| 47 |
+
|
| 48 |
+
## Pre-trained Checkpoints
|
| 49 |
+
|
| 50 |
+
| Checkpoint | Resolution | Epochs | gFID | CFG Scale | Time Shift | CFG Interval |
|
| 51 |
+
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 52 |
+
| `imagenet256_pixeldit_xl_epoch80.ckpt` | 256x256 | 80 | **2.36** | 3.25 | 1.0 | [0.1, 1.0] |
|
| 53 |
+
| `imagenet256_pixeldit_xl_epoch160.ckpt` | 256x256 | 160 | **1.97** | 3.25 | 1.0 | [0.1, 1.0] |
|
| 54 |
+
| `imagenet256_pixeldit_xl_epoch320.ckpt` | 256x256 | 320 | **1.61** | 2.75 | 1.0 | [0.1, 0.9] |
|
| 55 |
+
| `imagenet512_pixeldit_xl.ckpt` | 512x512 | 850 | **1.78** | 3.5 | 2.0 | [0.1, 1.0] |
|
| 56 |
+
|
| 57 |
+
All evaluations use **FlowDPMSolver** with **100 steps**. 50K samples. Metrics follow the ADM evaluation protocol.
|
| 58 |
+
|
| 59 |
+
## Usage
|
| 60 |
+
|
| 61 |
+
### Installation
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
pip install torch torchvision lightning omegaconf timm wandb h5py
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### Evaluation (Generate 50K Samples)
|
| 68 |
+
|
| 69 |
+
```bash
|
| 70 |
+
cd c2i/
|
| 71 |
+
|
| 72 |
+
# ImageNet 256x256 (epoch 320, best FID)
|
| 73 |
+
torchrun --nproc_per_node=8 main.py predict \
|
| 74 |
+
-c configs/pix256_xl.yaml \
|
| 75 |
+
--ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \
|
| 76 |
+
--model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
|
| 77 |
+
--model.diffusion_sampler.init_args.num_steps=100 \
|
| 78 |
+
--model.diffusion_sampler.init_args.guidance=2.75 \
|
| 79 |
+
--model.diffusion_sampler.init_args.timeshift=1.0 \
|
| 80 |
+
--model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
|
| 81 |
+
--model.diffusion_sampler.init_args.guidance_interval_max=0.9 \
|
| 82 |
+
--per_run_seed=false --seed_everything=1000
|
| 83 |
+
|
| 84 |
+
# ImageNet 512x512
|
| 85 |
+
torchrun --nproc_per_node=8 main.py predict \
|
| 86 |
+
-c configs/pix512_xl.yaml \
|
| 87 |
+
--ckpt_path=imagenet512_pixeldit_xl.ckpt \
|
| 88 |
+
--model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
|
| 89 |
+
--model.diffusion_sampler.init_args.num_steps=100 \
|
| 90 |
+
--model.diffusion_sampler.init_args.guidance=3.5 \
|
| 91 |
+
--model.diffusion_sampler.init_args.timeshift=2.0 \
|
| 92 |
+
--model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
|
| 93 |
+
--model.diffusion_sampler.init_args.guidance_interval_max=1.0 \
|
| 94 |
+
--per_run_seed=false --seed_everything=10000
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
After generating samples, compute FID with the [ADM evaluation toolkit](https://github.com/openai/guided-diffusion/tree/main/evaluations).
|
| 98 |
+
|
| 99 |
+
## Model Architecture
|
| 100 |
+
|
| 101 |
+
| Component | Value |
|
| 102 |
+
|-----------|-------|
|
| 103 |
+
| Parameters | 797M |
|
| 104 |
+
| Input channels | 3 (RGB) |
|
| 105 |
+
| Patch size | 16 |
|
| 106 |
+
| Hidden size | 1152 |
|
| 107 |
+
| Attention heads | 16 |
|
| 108 |
+
| Patch-level depth | 26 |
|
| 109 |
+
| Pixel-level depth | 4 |
|
| 110 |
+
| Pixel hidden size | 16 |
|
| 111 |
+
| Classes | 1000 (ImageNet) |
|
| 112 |
+
|
| 113 |
+
## Citation
|
| 114 |
+
|
| 115 |
+
```bibtex
|
| 116 |
+
@misc{yu2025pixeldit,
|
| 117 |
+
title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
|
| 118 |
+
author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
|
| 119 |
+
year={2025},
|
| 120 |
+
eprint={2511.20645},
|
| 121 |
+
archivePrefix={arXiv},
|
| 122 |
+
primaryClass={cs.CV},
|
| 123 |
+
url={https://arxiv.org/abs/2511.20645},
|
| 124 |
+
}
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## License
|
| 128 |
+
|
| 129 |
+
This model is released under the [NVIDIA OneWay Non-Commercial License](LICENSE). The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.
|