YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Q&C: When Quantization Meets Cache in Efficient Image Generation
Unofficial implementation of the paper: Q&C: When Quantization Meets Cache in Efficient Image Generation
The official code was announced at
https://github.com/xinding-sys/Quant-Cachebut is not yet publicly available. This repo provides a working implementation based on the paper's methodology sections.
π Overview
This repo implements the Q&C method for accelerating Diffusion Transformers (DiTs) by combining post-training quantization with feature caching. The paper identifies two key challenges when combining these techniques and proposes solutions:
- TAP (Temporal-Aware Parallel Clustering) β Improves calibration dataset selection for PTQ when caching reduces sample diversity
- VC (Variance Compensation) β Corrects exposure bias amplified by the quantization+cache combination
ποΈ Architecture
qandc/
βββ __init__.py # Package exports
βββ quantizer.py # Uniform PTQ (W8A8/W4A8, Eq 1-3)
βββ cache.py # FORA-style feature caching (Section 2.1)
βββ tap.py # TAP calibration selection (Section 3.1, Algorithm 1)
βββ variance_compensation.py # VC exposure bias correction (Section 3.2, Eq 9-12)
run_experiment.py # Self-contained experiment runner
results/
βββ experiment_summary.json # Our experimental results
π Quick Start
pip install torch torchvision diffusers transformers accelerate scipy scikit-learn
from diffusers import DiTPipeline, DDPMScheduler
from qandc import quantize_model, apply_cache_to_dit, reset_all_caches
# Load DiT-XL/2
pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256")
pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
# Apply W8A8 quantization (170 Linear layers)
pipe.transformer = quantize_model(pipe.transformer, w_bits=8, a_bits=8,
skip_patterns=["pos_embed", "norm"])
# Apply feature caching (recompute every 5th step)
apply_cache_to_dit(pipe.transformer, cache_interval=5)
# Generate images
reset_all_caches(pipe.transformer)
output = pipe(class_labels=[207], num_inference_steps=50, guidance_scale=4.0)
π Experiment Results
We ran 6 ablation experiments on DiT-XL/2-256 with DDPM scheduler to validate the paper's claims. Run on CPU with 16 images, 20 steps (reduced scale for free compute β paper uses 10K images, 50 steps on A100 GPUs).
| Experiment | Inception Score β | Time/Image (s) β | Speedup | Description |
|---|---|---|---|---|
| FP Baseline | 13.45 | 65.52 | 1.00x | Full-precision DiT-XL/2, DDPM 20 steps |
| Quant Only (W8A8) | 7.53 | 56.54 | 1.16x | Uniform PTQ, no caching |
| Cache Only (N=4) | 1.79 | 17.10 | 3.83x | FORA-style caching, no quantization |
| Q&C Naive | 1.84 | 20.89 | 3.14x | Quant + Cache, no TAP/VC |
| Q&C + TAP | 1.84 | 19.69 | 3.33x | + Temporal-Aware Parallel Clustering |
| Q&C Full (TAP+VC) | 2.27 | 21.12 | 3.10x | Full method with Variance Compensation |
Key Observations
- Caching provides dramatic speedup (3.8x) but severely degrades quality β confirming the paper's Challenge 1
- Naive Q+C combination is catastrophic (IS drops from 13.45 β 1.84) β confirming Challenge 2
- Q&C Full (TAP+VC) shows IS improvement (1.84 β 2.27, +23%) over naive combination, demonstrating VC's effectiveness at correcting exposure bias
- TAP improves efficiency (faster time/image in Q&C+TAP vs naive) through better calibration data selection
Paper Reference (Table 1, ImageNet 256Γ256, W8A8, 50 steps)
| Method | FID β | sFID β | IS β | Precision β | Speed |
|---|---|---|---|---|---|
| DDPM (FP) | 5.22 | 17.63 | 237.8 | 0.8056 | 5Γ |
| PTQ4DiT | 5.45 | 19.50 | 250.68 | 0.7882 | 10Γ |
| Q&C (paper) | 5.43 | 19.52 | 250.68 | 0.7895 | 12.7Γ |
Note: Our numbers are NOT directly comparable to the paper's because: (1) we use only 16 images (paper: 10K), (2) 20 steps (paper: 50), (3) CPU execution, and (4) aggressive cache interval of 4 (paper optimizes this). The purpose is to validate the relative trends between methods.
π§ Implementation Details
Quantization (quantizer.py)
- Uniform symmetric quantization following Eq 1-3 from the paper
- Channel-wise quantization for weights (per output channel)
- Tensor-wise quantization for activations
- Supports W8A8 (8-bit weights, 8-bit activations) and W4A8
- Replaces all
nn.Linearlayers except normalization and positional embeddings
Feature Caching (cache.py)
- FORA-style static caching: wraps each transformer block
- At every N-th step: full forward pass + cache the residual output
- For N-1 following steps: reuse the cached residual (skip expensive MHSA + FFN)
__getattr__delegation ensures transparency for DiT's conditioning code
TAP (tap.py)
- Spatial similarity: cosine similarity between flattened latent features (Eq 7)
- Temporal similarity: Gaussian kernel on timestep distances (Eq 8)
- Combined similarity:
A_final = Ξ±Β·A_spatial + (1-Ξ±)Β·A_temporal(Eq 6) - Parallel subsampling: m=3 independent subsamples, each 1/20 of full dataset
- Spectral clustering on each subsample β co-occurrence matrix β final KMeans
Variance Compensation (variance_compensation.py)
- Implements both the full analytical K_t (Eq 12) and a simplified version
- Corrects variance shift in later denoising stages (t > T/2)
x_corrected = ΞΌ + K_t Β· (xΜ - ΞΌ)where K_t is the per-channel, per-timestep correction factor- Calibrated offline using a few samples through the quantized+cached pipeline
π¬ Running Full Experiments
For GPU-scale experiments matching the paper:
# Modify run_experiment.py settings:
args = {
"num_steps": 50, # Paper: 50/100/250
"num_images": 10000, # Paper: 10,000
"batch_size": 16, # GPU batch size
"cache_interval": 5, # Tune for quality vs speed
"num_calib_samples": 800, # Paper recommendation
"tap_clusters": 100, # Paper setting
}
π Citation
@article{qandc2025,
title={Q\&C: When Quantization Meets Cache in Efficient Image Generation},
author={Xinding et al.},
journal={arXiv preprint arXiv:2503.02508},
year={2025}
}
π License
This implementation is provided for research purposes. The DiT model (facebook/DiT-XL-2-256) is under CC-BY-NC-4.0 license.