arxiv:2602.03537

MatGPTQ: Accurate and Efficient Post-Training Matryoshka Quantization

Published on Feb 3

IST Austria Distributed Algorithms and Systems Lab

Upvote

Authors:

Maximilian Kleinegger ,

Abstract

Post-Training Matryoshka Quantization (MatGPTQ) enables single-checkpoint, multi-precision model deployment through efficient one-shot quantization with bit-slicing and cross-bit error compensation.

AI-generated summary

Matryoshka Quantization (MatQuant) is a recent quantization approach showing that a single integer-quantized model can be served across multiple precisions, by slicing the most significant bits (MSB) at inference time. This enables a single checkpoint to cover a wide range of memory and latency budgets, but renders quantization much more challenging. In particular, the initial MatQuant relies on expensive quantization-aware training (QAT) variants, rather than fast one-shot post training quantization (PTQ), and lacks open-source and kernel support. We address all of these limitations by introducing Post-Training Matryoshka Quantization (MatGPTQ), a new PTQ pipeline that produces a single parent model jointly optimized for multiple target precisions in one-shot, based on a small calibration set. MatGPTQ casts Matryoshka quantization as a multi-precision objective with bit-slicing and cross-bit error compensation, resulting in an algorithm that produces a multi-bit-width, "sliceable" model in a single pass. We also incorporate a new budget-aware search for heterogeneous per-layer bit-witdhs and provide efficient kernels that implement slicing and mixed-precision execution. Across standard LLMs and benchmarks, MatGPTQ preserves high-bit accuracy while substantially improving performance at low-bit-witdh settings. Overall, we establish a new state of the art for Matryoshka-style post-training quantization and make single-checkpoint, multi-precision deployment open and practical. Code is available at https://github.com/IST-DASLab/MatGPTQ.

View arXiv page View PDF GitHub 4 Add to collection

Community

mkleinegger

Paper author about 5 hours ago

An accurate and efficient post-training quantization method that jointly optimizes multiple bit-widths, producing a single sliceable checkpoint that can be deployed seamlessly across diverse hardware and memory budgets.

On-par performance with native GPTQ, plus custom CUDA kernels and full vLLM support.

Code is available on GitHub.
Models are available on HuggingFace.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.03537 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.03537 in a Space README.md to link it from this page.