Q&C: When Quantization Meets Cache in Efficient Image Generation

Unofficial implementation of the paper: Q&C: When Quantization Meets Cache in Efficient Image Generation

The official code was announced at https://github.com/xinding-sys/Quant-Cache but is not yet publicly available. This repo provides a working implementation based on the paper's methodology sections.

📋 Overview

This repo implements the Q&C method for accelerating Diffusion Transformers (DiTs) by combining post-training quantization with feature caching. The paper identifies two key challenges when combining these techniques and proposes solutions:

TAP (Temporal-Aware Parallel Clustering) — Improves calibration dataset selection for PTQ when caching reduces sample diversity
VC (Variance Compensation) — Corrects exposure bias amplified by the quantization+cache combination

🏗️ Architecture

qandc/
├── __init__.py                    # Package exports
├── quantizer.py                   # Uniform PTQ (W8A8/W4A8, Eq 1-3)
├── cache.py                       # FORA-style feature caching (Section 2.1)
├── tap.py                         # TAP calibration selection (Section 3.1, Algorithm 1)
└── variance_compensation.py       # VC exposure bias correction (Section 3.2, Eq 9-12)

run_experiment.py                  # Self-contained experiment runner
results/
└── experiment_summary.json        # Our experimental results

🚀 Quick Start

pip install torch torchvision diffusers transformers accelerate scipy scikit-learn

from diffusers import DiTPipeline, DDPMScheduler
from qandc import quantize_model, apply_cache_to_dit, reset_all_caches

# Load DiT-XL/2
pipe = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256")
pipe.scheduler = DDPMScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

# Apply W8A8 quantization (170 Linear layers)
pipe.transformer = quantize_model(pipe.transformer, w_bits=8, a_bits=8,
                                   skip_patterns=["pos_embed", "norm"])

# Apply feature caching (recompute every 5th step)
apply_cache_to_dit(pipe.transformer, cache_interval=5)

# Generate images
reset_all_caches(pipe.transformer)
output = pipe(class_labels=[207], num_inference_steps=50, guidance_scale=4.0)

📊 Experiment Results

We ran 6 ablation experiments on DiT-XL/2-256 with DDPM scheduler to validate the paper's claims. Run on CPU with 16 images, 20 steps (reduced scale for free compute — paper uses 10K images, 50 steps on A100 GPUs).

Experiment	Inception Score ↑	Time/Image (s) ↓	Speedup	Description
FP Baseline	13.45	65.52	1.00x	Full-precision DiT-XL/2, DDPM 20 steps
Quant Only (W8A8)	7.53	56.54	1.16x	Uniform PTQ, no caching
Cache Only (N=4)	1.79	17.10	3.83x	FORA-style caching, no quantization
Q&C Naive	1.84	20.89	3.14x	Quant + Cache, no TAP/VC
Q&C + TAP	1.84	19.69	3.33x	+ Temporal-Aware Parallel Clustering
Q&C Full (TAP+VC)	2.27	21.12	3.10x	Full method with Variance Compensation

Key Observations

Caching provides dramatic speedup (3.8x) but severely degrades quality — confirming the paper's Challenge 1
Naive Q+C combination is catastrophic (IS drops from 13.45 → 1.84) — confirming Challenge 2
Q&C Full (TAP+VC) shows IS improvement (1.84 → 2.27, +23%) over naive combination, demonstrating VC's effectiveness at correcting exposure bias
TAP improves efficiency (faster time/image in Q&C+TAP vs naive) through better calibration data selection

Paper Reference (Table 1, ImageNet 256×256, W8A8, 50 steps)

Method	FID ↓	sFID ↓	IS ↑	Precision ↑	Speed
DDPM (FP)	5.22	17.63	237.8	0.8056	5×
PTQ4DiT	5.45	19.50	250.68	0.7882	10×
Q&C (paper)	5.43	19.52	250.68	0.7895	12.7×

Note: Our numbers are NOT directly comparable to the paper's because: (1) we use only 16 images (paper: 10K), (2) 20 steps (paper: 50), (3) CPU execution, and (4) aggressive cache interval of 4 (paper optimizes this). The purpose is to validate the relative trends between methods.

🔧 Implementation Details

Quantization (quantizer.py)

Uniform symmetric quantization following Eq 1-3 from the paper
Channel-wise quantization for weights (per output channel)
Tensor-wise quantization for activations
Supports W8A8 (8-bit weights, 8-bit activations) and W4A8
Replaces all nn.Linear layers except normalization and positional embeddings

Feature Caching (cache.py)

FORA-style static caching: wraps each transformer block
At every N-th step: full forward pass + cache the residual output
For N-1 following steps: reuse the cached residual (skip expensive MHSA + FFN)
__getattr__ delegation ensures transparency for DiT's conditioning code

TAP (tap.py)

Spatial similarity: cosine similarity between flattened latent features (Eq 7)
Temporal similarity: Gaussian kernel on timestep distances (Eq 8)
Combined similarity: A_final = α·A_spatial + (1-α)·A_temporal (Eq 6)
Parallel subsampling: m=3 independent subsamples, each 1/20 of full dataset
Spectral clustering on each subsample → co-occurrence matrix → final KMeans

Variance Compensation (variance_compensation.py)

Implements both the full analytical K_t (Eq 12) and a simplified version
Corrects variance shift in later denoising stages (t > T/2)
x_corrected = μ + K_t · (x̂ - μ) where K_t is the per-channel, per-timestep correction factor
Calibrated offline using a few samples through the quantized+cached pipeline

🔬 Running Full Experiments

For GPU-scale experiments matching the paper:

# Modify run_experiment.py settings:
args = {
    "num_steps": 50,           # Paper: 50/100/250
    "num_images": 10000,       # Paper: 10,000
    "batch_size": 16,          # GPU batch size
    "cache_interval": 5,       # Tune for quality vs speed
    "num_calib_samples": 800,  # Paper recommendation
    "tap_clusters": 100,       # Paper setting
}

📝 Citation

@article{qandc2025,
  title={Q\&C: When Quantization Meets Cache in Efficient Image Generation},
  author={Xinding et al.},
  journal={arXiv preprint arXiv:2503.02508},
  year={2025}
}

📄 License

This implementation is provided for research purposes. The DiT model (facebook/DiT-XL-2-256) is under CC-BY-NC-4.0 license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sanskar753/QandC-Quantization-Meets-Cache

Q&C: When Quantization Meets Cache in Efficient Image Generation

Paper • 2503.02508 • Published Mar 4, 2025