ahatamiz commited on
Commit
ea6c8c4
·
verified ·
1 Parent(s): 8af2d1f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -5
README.md CHANGED
@@ -1,5 +1,104 @@
1
- ---
2
- license: other
3
- license_name: nsclv1
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nsclv1
4
+ license_link: LICENSE
5
+ tags:
6
+ - image-generation
7
+ - diffusion
8
+ - vision-transformer
9
+ - class-conditional
10
+ datasets:
11
+ - imagenet-1k
12
+ ---
13
+
14
+ # DiffiT: Diffusion Vision Transformers for Image Generation
15
+
16
+ [![Paper](https://img.shields.io/badge/arXiv-2312.02139-b31b1b.svg)](https://arxiv.org/abs/2312.02139)
17
+ [![GitHub](https://img.shields.io/github/stars/NVlabs/DiffiT.svg?style=social)](https://github.com/NVlabs/DiffiT)
18
+
19
+ This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.
20
+
21
+ ## Overview
22
+
23
+ **DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively.
24
+
25
+ ![imagenet](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/5Pbe6fTZAV5eAwH6eokdh.png)
26
+
27
+ ![latent_diffit](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/2hPFK3g2uHfDR1bzhYJyJ.png)
28
+
29
+ ## Models
30
+
31
+ ### ImageNet-256
32
+
33
+ | Model | Dataset | Resolution | FID-50K | Inception Score | Download |
34
+ |-------|---------|------------|---------|-----------------|----------|
35
+ | **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) |
36
+
37
+ ### ImageNet-512
38
+
39
+ | Model | Dataset | Resolution | FID-50K | Inception Score | Download |
40
+ |-------|---------|------------|---------|-----------------|----------|
41
+ | **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) |
42
+
43
+ ## Usage
44
+
45
+ Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts.
46
+
47
+ ### Sampling Images
48
+
49
+ Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below.
50
+
51
+ **ImageNet-256:**
52
+
53
+ ```bash
54
+ python sample.py \
55
+ --log_dir $LOG_DIR \
56
+ --cfg_scale 4.4 \
57
+ --model_path $MODEL \
58
+ --image_size 256 \
59
+ --model Diffit \
60
+ --num_sampling_steps 250 \
61
+ --num_samples 50000 \
62
+ --cfg_cond True
63
+ ```
64
+
65
+ **ImageNet-512:**
66
+
67
+ ```bash
68
+ python sample.py \
69
+ --log_dir $LOG_DIR \
70
+ --cfg_scale 1.49 \
71
+ --model_path $MODEL \
72
+ --image_size 512 \
73
+ --model Diffit \
74
+ --num_sampling_steps 250 \
75
+ --num_samples 50000 \
76
+ --cfg_cond True
77
+ ```
78
+
79
+ ### Evaluation
80
+
81
+ Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations).
82
+
83
+ ```bash
84
+ bash eval_run.sh
85
+ ```
86
+
87
+ ## Citation
88
+
89
+ ```bibtex
90
+ @inproceedings{hatamizadeh2025diffit,
91
+ title={Diffit: Diffusion vision transformers for image generation},
92
+ author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
93
+ booktitle={European Conference on Computer Vision},
94
+ pages={37--55},
95
+ year={2025},
96
+ organization={Springer}
97
+ }
98
+ ```
99
+
100
+ ## License
101
+
102
+ Copyright © 2026, NVIDIA Corporation. All rights reserved.
103
+
104
+ The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.