nvidia
/

DiffiT

+---
+license: other
+license_name: nsclv1
+license_link: LICENSE
+tags:
+  - image-generation
+  - diffusion
+  - vision-transformer
+  - class-conditional
+datasets:
+  - imagenet-1k
+---
+# DiffiT: Diffusion Vision Transformers for Image Generation
+[![Paper](https://img.shields.io/badge/arXiv-2312.02139-b31b1b.svg)](https://arxiv.org/abs/2312.02139)
+[![GitHub](https://img.shields.io/github/stars/NVlabs/DiffiT.svg?style=social)](https://github.com/NVlabs/DiffiT)
+This repository hosts the pretrained model weights for [**DiffiT**](https://arxiv.org/abs/2312.02139) (ECCV 2024), a diffusion model built on Vision Transformers that achieves state-of-the-art image generation quality with improved parameter efficiency.
+## Overview
+**DiffiT** (Diffusion Vision Transformers) is a generative model that combines the expressive power of diffusion models with Vision Transformers (ViTs), introducing **Time-dependent Multihead Self Attention (TMSA)** for fine-grained control over the denoising process at each diffusion timestep. DiffiT achieves state-of-the-art performance on class-conditional ImageNet generation at multiple resolutions, notably attaining an **FID score of 1.73** on ImageNet-256 while using **19.85% and 16.88% fewer parameters** than comparable Transformer-based diffusion models such as MDT and DiT, respectively.
+![imagenet](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/5Pbe6fTZAV5eAwH6eokdh.png)
+![latent_diffit](https://cdn-uploads.huggingface.co/production/uploads/64414b62603214724ebd2636/2hPFK3g2uHfDR1bzhYJyJ.png)
+## Models
+### ImageNet-256
+| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
+|-------|---------|------------|---------|-----------------|----------|
+| **DiffiT** | ImageNet | 256×256 | **1.73** | **276.49** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_256.safetensors) |
+### ImageNet-512
+| Model | Dataset | Resolution | FID-50K | Inception Score | Download |
+|-------|---------|------------|---------|-----------------|----------|
+| **DiffiT** | ImageNet | 512×512 | **2.67** | **252.12** | [model](https://huggingface.co/nvidia/DiffiT/resolve/main/diffit_512.safetensors) |
+## Usage
+Please refer to the official [GitHub repository](https://github.com/NVlabs/DiffiT) for full setup instructions, training code, and evaluation scripts.
+### Sampling Images
+Image sampling is performed using `sample.py` from the [DiffiT repository](https://github.com/NVlabs/DiffiT). To reproduce the reported numbers, use the commands below.
+**ImageNet-256:**
+```bash
+python sample.py \
+    --log_dir $LOG_DIR \
+    --cfg_scale 4.4 \
+    --model_path $MODEL \
+    --image_size 256 \
+    --model Diffit \
+    --num_sampling_steps 250 \
+    --num_samples 50000 \
+    --cfg_cond True
+```
+**ImageNet-512:**
+```bash
+python sample.py \
+    --log_dir $LOG_DIR \
+    --cfg_scale 1.49 \
+    --model_path $MODEL \
+    --image_size 512 \
+    --model Diffit \
+    --num_sampling_steps 250 \
+    --num_samples 50000 \
+    --cfg_cond True
+```
+### Evaluation
+Once images have been sampled, you can compute FID and other metrics using the provided `eval_run.sh` script in the repository. The evaluation pipeline follows the protocol from [openai/guided-diffusion/evaluations](https://github.com/openai/guided-diffusion/tree/main/evaluations).
+```bash
+bash eval_run.sh
+```
+## Citation
+```bibtex
+@inproceedings{hatamizadeh2025diffit,
+  title={Diffit: Diffusion vision transformers for image generation},
+  author={Hatamizadeh, Ali and Song, Jiaming and Liu, Guilin and Kautz, Jan and Vahdat, Arash},
+  booktitle={European Conference on Computer Vision},
+  pages={37--55},
+  year={2025},
+  organization={Springer}
+}
+```
+## License
+Copyright © 2026, NVIDIA Corporation. All rights reserved.
+The code is released under the [NVIDIA Source Code License-NC](https://github.com/NVlabs/DiffiT/blob/main/LICENSE). The pretrained models are shared under [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.