AttnVQ — Attention-Aware KV Cache Quantization

Training-free product vector quantization of the KV cache for long-context LLMs. AttnVQ fits small per-subspace codebooks with attention-weighted batched LBG (centroids weighted by key attention mass from GQA causal attention), but scores distortion by attention-output error (and key cosine / inner-product bias), not cache MSE. Calibration is light: 10–15 agent traces, ~15 s on GPU — enough to capture the model's K/V geometry (data-aware, not corpus-dependent).

Primary target: Laguna-XS.2 (model-agnostic). Only the 10 full-attention layers are compressed; 30 sliding-window layers stay fp16.

What this repo offers

Component	Description
`generate.py`	Minimal inference: `VQQuantizedCache` → `model.generate()`
`vqkv/`	Quantizers (ProductVQ, RoPESplit, scalar/KIVI baselines), attention-aware metrics, compressed cache
`benchmark.py`	Fit codebooks + cheap metrics (key cosine, attn-output error, ip-bias) on real cache dumps
`turbo_benchmark.py`	Faithful TurboQuant baseline (Haar rotation + Lloyd-Max + QJL)
`longbench_eval.py`	LongBench v1 proxy metrics + optional end-to-end task scoring
`app.py`	Gradio demo for live generation
`attnvq_slides.html`	Slides - includes LongBench metrics
`artifacts/`	Pre-fit codebooks and LongBench results

Variants: productvq-* (AttnVQ), ropesplit-1b (RoPE-half split for Laguna), scalar/KIVI/sign/ternary baselines, TurboQuant MSE/Prod.

Headline results (Laguna-XS.2)

Memory @ 131K context (full-attention layers only):

Config	KV cache
fp16	5.4 GB
AttnVQ 2-bit (`productvq-32x256-2b`)	0.73 GB (7.4×)
AttnVQ 1-bit (`productvq-16x256-1b`)	0.40 GB (14×)

LongBench v1 (mean F1 over qasper, 2wikimqa, hotpotqa, repobench-p; single 15-trace codebook):

2-bit: ~96% of fp16 — TurboQuant ~83%, INT2 ~75%
1-bit: AttnVQ and RoPESplit beat every iso-budget baseline on every task
0.5-bit: only VQ reaches this regime at all

Full numbers: artifacts/longbench_results.json, artifacts/longbench_cheap_metrics.json.

Note: Wall-clock speedup requires a fused dequant kernel.

Quick start

Tested on CUDA 12.4 / NVIDIA A100.

pip install "git+https://github.com/huggingface/transformers.git" \
    accelerate datasets torch==2.9.1 torchvision tqdm
python generate.py

Use fitted codebooks for memory efficient long context generation:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from vqkv.compressed_cache import VQQuantizedCache


# load model
tok = AutoTokenizer.from_pretrained("poolside/Laguna-XS.2", trust_remote_code=True, fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained("poolside/Laguna-XS.2", torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True).eval()

# load codebooks or fit and use your own
CODEBOOKS_PATH = "artifacts/codebooks.pt"
codebooks = torch.load(CODEBOOKS_PATH, map_location="cuda", weights_only=False)

# build cache
quantizers, layers = codebooks["fitted"]["productvq-32x256-2b"], codebooks["meta"]["full_layers"]
cache = VQQuantizedCache(quantizers, layers)  # persists uint8 codebook indices

# generate
ids = tok("Hello", return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=32, past_key_values=cache, use_cache=True) 
print(tok.decode(out[0, ids["input_ids"].shape[1]:], skip_special_tokens=True))

# print memory footprint
print(cache.memory_footprint())

Reproduce

Precomputed results are under artifacts/. To re-fit and evaluate:

python benchmark.py --stage fit
python turbo_benchmark.py --stage fit 
python longbench_eval.py --stage cheap --n_eval 50  # cheap metrics
python longbench_eval.py --stage generate --n_eval 50   # slow: full generation & task metrics

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support