yongshengy commited on
Commit
08f246c
·
verified ·
1 Parent(s): 517491a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +129 -5
README.md CHANGED
@@ -1,5 +1,129 @@
1
- ---
2
- license: other
3
- license_name: other
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nscl-v1
4
+ license_link: LICENSE
5
+ tags:
6
+ - image-generation
7
+ - class-conditional
8
+ - diffusion
9
+ - pixel-space
10
+ - dit
11
+ - imagenet
12
+ library_name: pytorch
13
+ pipeline_tag: unconditional-image-generation
14
+ ---
15
+
16
+ <p align="center">
17
+ <img src="https://raw.githubusercontent.com/pixeldit/pixeldit.github.io/main/static/images/pixeldit-logo.png" height="120" />
18
+ </p>
19
+
20
+ <h2 align="center">PixelDiT: Pixel Diffusion Transformers for Image Generation</h2>
21
+
22
+ <p align="center">
23
+ <a href="https://www.yongshengyu.com/">Yongsheng Yu</a><sup>1,2</sup> &nbsp;
24
+ <a href="https://wxiong.me/">Wei Xiong</a><sup>1†</sup> &nbsp;
25
+ <a href="https://weilinie.github.io/">Weili Nie</a><sup>1</sup> &nbsp;
26
+ <a href="https://shengcn.github.io/">Yichen Sheng</a><sup>1</sup> &nbsp;
27
+ <a href="http://behindthepixels.io/">Shiqiu Liu</a><sup>1</sup> &nbsp;
28
+ <a href="https://www.cs.rochester.edu/u/jluo/">Jiebo Luo</a><sup>2</sup>
29
+ </p>
30
+ <p align="center">
31
+ <sup>1</sup>NVIDIA &nbsp; <sup>2</sup>University of Rochester
32
+ <br>
33
+ <sup>†</sup>Project Lead and Main Advising
34
+ </p>
35
+
36
+ <p align="center">
37
+ <a href="https://pixeldit.github.io/"><img src="https://img.shields.io/badge/Website-Project_Page-2ea44f" /></a>
38
+ &nbsp;
39
+ <a href="https://arxiv.org/abs/2511.20645"><img src="https://img.shields.io/badge/arXiv-2511.20645-b31b1b.svg" /></a>
40
+ &nbsp;
41
+ <a href="https://github.com/NVlabs/PixelDiT"><img src="https://img.shields.io/badge/GitHub-Code-blue" /></a>
42
+ </p>
43
+
44
+ ## Model Overview
45
+
46
+ **PixelDiT-XL** (797M parameters) is a class-conditional image generation model trained on ImageNet, operating directly in **pixel space** — no VAE, no latent space. It uses a dual-level architecture combining a patch-level DiT for global semantics with a pixel-level DiT for fine texture details.
47
+
48
+ ## Pre-trained Checkpoints
49
+
50
+ | Checkpoint | Resolution | Epochs | gFID | CFG Scale | Time Shift | CFG Interval |
51
+ |:---|:---:|:---:|:---:|:---:|:---:|:---:|
52
+ | `imagenet256_pixeldit_xl_epoch80.ckpt` | 256x256 | 80 | **2.36** | 3.25 | 1.0 | [0.1, 1.0] |
53
+ | `imagenet256_pixeldit_xl_epoch160.ckpt` | 256x256 | 160 | **1.97** | 3.25 | 1.0 | [0.1, 1.0] |
54
+ | `imagenet256_pixeldit_xl_epoch320.ckpt` | 256x256 | 320 | **1.61** | 2.75 | 1.0 | [0.1, 0.9] |
55
+ | `imagenet512_pixeldit_xl.ckpt` | 512x512 | 850 | **1.78** | 3.5 | 2.0 | [0.1, 1.0] |
56
+
57
+ All evaluations use **FlowDPMSolver** with **100 steps**. 50K samples. Metrics follow the ADM evaluation protocol.
58
+
59
+ ## Usage
60
+
61
+ ### Installation
62
+
63
+ ```bash
64
+ pip install torch torchvision lightning omegaconf timm wandb h5py
65
+ ```
66
+
67
+ ### Evaluation (Generate 50K Samples)
68
+
69
+ ```bash
70
+ cd c2i/
71
+
72
+ # ImageNet 256x256 (epoch 320, best FID)
73
+ torchrun --nproc_per_node=8 main.py predict \
74
+ -c configs/pix256_xl.yaml \
75
+ --ckpt_path=imagenet256_pixeldit_xl_epoch320.ckpt \
76
+ --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
77
+ --model.diffusion_sampler.init_args.num_steps=100 \
78
+ --model.diffusion_sampler.init_args.guidance=2.75 \
79
+ --model.diffusion_sampler.init_args.timeshift=1.0 \
80
+ --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
81
+ --model.diffusion_sampler.init_args.guidance_interval_max=0.9 \
82
+ --per_run_seed=false --seed_everything=1000
83
+
84
+ # ImageNet 512x512
85
+ torchrun --nproc_per_node=8 main.py predict \
86
+ -c configs/pix512_xl.yaml \
87
+ --ckpt_path=imagenet512_pixeldit_xl.ckpt \
88
+ --model.diffusion_sampler.class_path=src.diffusion.FlowDPMSolverSampler \
89
+ --model.diffusion_sampler.init_args.num_steps=100 \
90
+ --model.diffusion_sampler.init_args.guidance=3.5 \
91
+ --model.diffusion_sampler.init_args.timeshift=2.0 \
92
+ --model.diffusion_sampler.init_args.guidance_interval_min=0.1 \
93
+ --model.diffusion_sampler.init_args.guidance_interval_max=1.0 \
94
+ --per_run_seed=false --seed_everything=10000
95
+ ```
96
+
97
+ After generating samples, compute FID with the [ADM evaluation toolkit](https://github.com/openai/guided-diffusion/tree/main/evaluations).
98
+
99
+ ## Model Architecture
100
+
101
+ | Component | Value |
102
+ |-----------|-------|
103
+ | Parameters | 797M |
104
+ | Input channels | 3 (RGB) |
105
+ | Patch size | 16 |
106
+ | Hidden size | 1152 |
107
+ | Attention heads | 16 |
108
+ | Patch-level depth | 26 |
109
+ | Pixel-level depth | 4 |
110
+ | Pixel hidden size | 16 |
111
+ | Classes | 1000 (ImageNet) |
112
+
113
+ ## Citation
114
+
115
+ ```bibtex
116
+ @misc{yu2025pixeldit,
117
+ title={PixelDiT: Pixel Diffusion Transformers for Image Generation},
118
+ author={Yongsheng Yu and Wei Xiong and Weili Nie and Yichen Sheng and Shiqiu Liu and Jiebo Luo},
119
+ year={2025},
120
+ eprint={2511.20645},
121
+ archivePrefix={arXiv},
122
+ primaryClass={cs.CV},
123
+ url={https://arxiv.org/abs/2511.20645},
124
+ }
125
+ ```
126
+
127
+ ## License
128
+
129
+ This model is released under the [NVIDIA OneWay Non-Commercial License](LICENSE). The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.