Quantization Reading-List
updated
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper
• 2208.07339
• Published
• 5
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained
Transformers
Paper
• 2210.17323
• Published
• 10
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
Language Models
Paper
• 2211.10438
• Published
• 6
QLoRA: Efficient Finetuning of Quantized LLMs
Paper
• 2305.14314
• Published
• 59
AWQ: Activation-aware Weight Quantization for LLM Compression and
Acceleration
Paper
• 2306.00978
• Published
• 11
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper
• 2309.14717
• Published
• 46
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
• 2310.08659
• Published
• 27
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper
• 2310.11453
• Published
• 106
FP8-LM: Training FP8 Large Language Models
Paper
• 2310.18313
• Published
• 33
GPTVQ: The Blessing of Dimensionality for LLM Quantization
Paper
• 2402.15319
• Published
• 22
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Paper
• 2402.17764
• Published
• 627
MixLLM: LLM Quantization with Global Mixed-precision between
Output-features and Highly-efficient System Design
Paper
• 2412.14590
• Published
• 15
Paper
• 2502.06786
• Published
• 32
BitNet b1.58 2B4T Technical Report
Paper
• 2504.12285
• Published
• 83
BitNet v2: Native 4-bit Activations with Hadamard Transformation for
1-bit LLMs
Paper
• 2504.18415
• Published
• 49
TernaryLLM: Ternarized Large Language Model
Paper
• 2406.07177
• Published
• 1
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Paper
• 2301.00774
• Published
• 4