azure-scripts / turboquant_case_study.md

azure home scripts: data gen, training, misc

a70eb3d verified 10 days ago

3.32 kB

	# TurboQuant: 44-59% KV-Cache Reduction With Zero Quality Loss

	## The Problem

	LLM inference is memory-bound. As context length grows, the KV-cache eats your VRAM alive. At 8K tokens, a Gemma-2-9B model burns nearly 4 GB just on KV-cache. That's memory you can't use for batching more requests, which means fewer concurrent users per GPU, which means higher cost per query.

	Every production team running LLMs hits this wall.

	## What TurboQuant Does

	TurboQuant applies mixed-precision quantization to the KV-cache during inference. It profiles each layer's activation norms, identifies outlier layers that need full precision, and quantizes the rest — cutting KV-cache memory by 44-59% while maintaining exact prefill fidelity.

	No retraining. No fine-tuning. Drop-in replacement.

	## Results

	Benchmarked across 5 model families on NVIDIA H100 NVL (96 GB):

	### Memory Savings at 8K Context

	\| Model \| Default VRAM \| TurboQuant VRAM \| KV-Cache Saved \| Prefill Fidelity \|
	\|-------\|-------------\|----------------\|----------------\|-----------------\|
	\| Gemma-2-9B \| 9.98 GB \| 7.71 GB \| 2,323 MB (~59%) \| Exact \|
	\| Qwen2.5-32B \| 23.16 GB \| 21.41 GB \| 1,791 MB (~47%) \| Exact \|
	\| Phi-4-14B \| 12.28 GB \| 10.92 GB \| 1,392 MB (~44%) \| Exact \|
	\| LLaMA-3.1-8B \| 7.71 GB \| 6.84 GB \| 890 MB (~44%) \| Exact \|
	\| Qwen2.5-7B \| 7.08 GB \| 6.71 GB \| 380 MB (~44%) \| Exact \|

	### Quality Verification

	- Prefill logit difference: 0.0 across all models — the quantized KV-cache produces identical logits at the prefill stage
	- Same top-1 token prediction: 100% — no drift in the most likely next token
	- Output coherence: 100% — both default and TurboQuant outputs are fully coherent across all test prompts
	- Token match rate: 18-100% on generation (expected — autoregressive sampling diverges naturally, but both outputs remain equally valid)

	### Scaling With Context Length

	Memory savings grow linearly with context. LLaMA-3.1-8B example:

	\| Context Length \| Saved \|
	\|---------------\|-------\|
	\| 1K tokens \| 93 MB \|
	\| 4K tokens \| 417 MB \|
	\| 8K tokens \| 890 MB \|

	At 32K or 128K context (LLaMA-3.1 supports 128K), the savings become massive — potentially 3-14 GB on a single model.

	### Outlier-Aware Design

	Not all layers are equal. TurboQuant detects outlier layers with abnormal activation norms and keeps them at full precision:

	- Qwen2.5-7B: layers 0 and 27 flagged as outliers (norms 273.84 and 239.91 vs median 16.86) — kept at BF16
	- All other models: uniform norm distributions, all layers quantized

	This is why quality stays intact — the layers that matter most keep their precision.

	## What This Means For Production

	If you're running LLMs in production:

	- 2-3x more concurrent users on the same GPU (freed VRAM = larger batch sizes)
	- Same quality — your users won't notice any difference
	- No model changes — works with any transformer architecture using standard KV-cache
	- Tested across: Qwen2, LLaMA, Gemma2, Phi3 architectures

	## About

	Built by Vivek Varikuti. I optimize LLM inference for production workloads.

	If your GPU bill is too high or your throughput is too low, I can help. Free 1-week proof-of-concept on your setup — you pay nothing if it doesn't beat your current numbers.

	Reach me: domainluther1234@gmail.com \| GitHub: vivekvar-dl

	# TurboQuant: 44-59% KV-Cache Reduction With Zero Quality Loss

	## The Problem

	LLM inference is memory-bound. As context length grows, the KV-cache eats your VRAM alive. At 8K tokens, a Gemma-2-9B model burns nearly 4 GB just on KV-cache. That's memory you can't use for batching more requests, which means fewer concurrent users per GPU, which means higher cost per query.

	Every production team running LLMs hits this wall.

	## What TurboQuant Does

	TurboQuant applies mixed-precision quantization to the KV-cache during inference. It profiles each layer's activation norms, identifies outlier layers that need full precision, and quantizes the rest — cutting KV-cache memory by 44-59% while maintaining exact prefill fidelity.

	No retraining. No fine-tuning. Drop-in replacement.

	## Results

	Benchmarked across 5 model families on NVIDIA H100 NVL (96 GB):

	### Memory Savings at 8K Context

	\| Model \| Default VRAM \| TurboQuant VRAM \| KV-Cache Saved \| Prefill Fidelity \|
	\|-------\|-------------\|----------------\|----------------\|-----------------\|
	\| Gemma-2-9B \| 9.98 GB \| 7.71 GB \| 2,323 MB (~59%) \| Exact \|
	\| Qwen2.5-32B \| 23.16 GB \| 21.41 GB \| 1,791 MB (~47%) \| Exact \|
	\| Phi-4-14B \| 12.28 GB \| 10.92 GB \| 1,392 MB (~44%) \| Exact \|
	\| LLaMA-3.1-8B \| 7.71 GB \| 6.84 GB \| 890 MB (~44%) \| Exact \|
	\| Qwen2.5-7B \| 7.08 GB \| 6.71 GB \| 380 MB (~44%) \| Exact \|

	### Quality Verification

	- Prefill logit difference: 0.0 across all models — the quantized KV-cache produces identical logits at the prefill stage
	- Same top-1 token prediction: 100% — no drift in the most likely next token
	- Output coherence: 100% — both default and TurboQuant outputs are fully coherent across all test prompts
	- Token match rate: 18-100% on generation (expected — autoregressive sampling diverges naturally, but both outputs remain equally valid)

	### Scaling With Context Length

	Memory savings grow linearly with context. LLaMA-3.1-8B example:

	\| Context Length \| Saved \|
	\|---------------\|-------\|
	\| 1K tokens \| 93 MB \|
	\| 4K tokens \| 417 MB \|
	\| 8K tokens \| 890 MB \|

	At 32K or 128K context (LLaMA-3.1 supports 128K), the savings become massive — potentially 3-14 GB on a single model.

	### Outlier-Aware Design

	Not all layers are equal. TurboQuant detects outlier layers with abnormal activation norms and keeps them at full precision:

	- Qwen2.5-7B: layers 0 and 27 flagged as outliers (norms 273.84 and 239.91 vs median 16.86) — kept at BF16
	- All other models: uniform norm distributions, all layers quantized

	This is why quality stays intact — the layers that matter most keep their precision.

	## What This Means For Production

	If you're running LLMs in production:

	- 2-3x more concurrent users on the same GPU (freed VRAM = larger batch sizes)
	- Same quality — your users won't notice any difference
	- No model changes — works with any transformer architecture using standard KV-cache
	- Tested across: Qwen2, LLaMA, Gemma2, Phi3 architectures

	## About

	Built by Vivek Varikuti. I optimize LLM inference for production workloads.

	If your GPU bill is too high or your throughput is too low, I can help. Free 1-week proof-of-concept on your setup — you pay nothing if it doesn't beat your current numbers.

	Reach me: domainluther1234@gmail.com \| GitHub: vivekvar-dl