Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Abstract
Speculative sampling methods are enhanced by formulating them as constrained optimization problems, enabling controlled distribution divergence while maintaining high acceptance rates and output quality.
Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the verifier LLM. This is unnecessarily restrictive as slight variations of the verifier's distribution, such as sampling with top-k or temperature, would also be acceptable. Typical acceptance sampling (TAS) alleviates this issue by accepting more tokens using entropy-based heuristics. However, this approach distorts the verifier distribution, potentially degrading output quality when the verifier encodes critical information. In this work, we formalize the speculative sampling algorithm through the lens of constrained optimization. Based on this formulation, we propose Cactus (constrained acceptance speculative sampling), a method that guarantees controlled divergence from the verifier distribution and increasing acceptance rates. Empirical results across a wide range of benchmarks confirm the effectiveness of our approach.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification (2026)
- DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification (2026)
- LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding (2026)
- ConFu: Contemplate the Future for Better Speculative Sampling (2026)
- MoE-Spec: Expert Budgeting for Efficient Speculative Decoding (2026)
- KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem (2026)
- SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.04987 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
