Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,104 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: other
|
| 3 |
-
license_name: nsclv1
|
| 4 |
-
license_link: LICENSE
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: nsclv1
|
| 4 |
+
license_link: LICENSE
|
| 5 |
+
tags:
|
| 6 |
+
- image-generation
|
| 7 |
+
- diffusion
|
| 8 |
+
- vision-transformer
|
| 9 |
+
- class-conditional
|
| 10 |
+
datasets:
|
| 11 |
+
- imagenet-1k
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# DiffiT: Diffusion Vision Transformers for Image Generation
|
| 15 |
+
|
| 16 |
+
[](https://arxiv.org/abs/2312.02139)
|
| 17 |
+
[](https://github.com/NVlabs/DiffiT)
|
| 18 |
+
|
| 19 |
+
This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.
|
| 20 |
+
|
| 21 |
+
## Overview
|
| 22 |
+
|
| 23 |
+
**DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively.
|
| 24 |
+
|
| 25 |
+

|
| 26 |
+
|
| 27 |
+

|
| 28 |
+
|
| 29 |
+
## Models
|
| 30 |
+
|
| 31 |
+
### ImageNet-256
|
| 32 |
+
|
| 33 |
+
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|
| 34 |
+
|-------|---------|------------|---------|-----------------|----------|
|
| 35 |
+
| **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) |
|
| 36 |
+
|
| 37 |
+
### ImageNet-512
|
| 38 |
+
|
| 39 |
+
| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
|
| 40 |
+
|-------|---------|------------|---------|-----------------|----------|
|
| 41 |
+
| **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) |
|
| 42 |
+
|
| 43 |
+
## Usage
|
| 44 |
+
|
| 45 |
+
Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts.
|
| 46 |
+
|
| 47 |
+
### Sampling Images
|
| 48 |
+
|
| 49 |
+
Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below.
|
| 50 |
+
|
| 51 |
+
**ImageNet-256:**
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
python sample.py \
|
| 55 |
+
--log_dir $LOG_DIR \
|
| 56 |
+
--cfg_scale 4.4 \
|
| 57 |
+
--model_path $MODEL \
|
| 58 |
+
--image_size 256 \
|
| 59 |
+
--model Diffit \
|
| 60 |
+
--num_sampling_steps 250 \
|
| 61 |
+
--num_samples 50000 \
|
| 62 |
+
--cfg_cond True
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
**ImageNet-512:**
|
| 66 |
+
|
| 67 |
+
```bash
|
| 68 |
+
python sample.py \
|
| 69 |
+
--log_dir $LOG_DIR \
|
| 70 |
+
--cfg_scale 1.49 \
|
| 71 |
+
--model_path $MODEL \
|
| 72 |
+
--image_size 512 \
|
| 73 |
+
--model Diffit \
|
| 74 |
+
--num_sampling_steps 250 \
|
| 75 |
+
--num_samples 50000 \
|
| 76 |
+
--cfg_cond True
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Evaluation
|
| 80 |
+
|
| 81 |
+
Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations).
|
| 82 |
+
|
| 83 |
+
```bash
|
| 84 |
+
bash eval_run.sh
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
## Citation
|
| 88 |
+
|
| 89 |
+
```bibtex
|
| 90 |
+
@inproceedings{hatamizadeh2025diffit,
|
| 91 |
+
title={Diffit: Diffusion vision transformers for image generation},
|
| 92 |
+
author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
|
| 93 |
+
booktitle={European Conference on Computer Vision},
|
| 94 |
+
pages={37--55},
|
| 95 |
+
year={2025},
|
| 96 |
+
organization={Springer}
|
| 97 |
+
}
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
## License
|
| 101 |
+
|
| 102 |
+
Copyright © 2026, NVIDIA Corporation. All rights reserved.
|
| 103 |
+
|
| 104 |
+
The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
|