Abstract
Speculative speculative decoding (SSD) accelerates autoregressive decoding by parallelizing speculation and verification operations through preemptive prediction of verification outcomes.
Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.
Community
Here are the main results from "Speculative Speculative Decoding" (SSD), which introduces a framework to parallelize drafting and verification in LLM inference:
1. Core Performance Results
SSD achieves up to 2× speedup over optimized speculative decoding (SD) and up to 5× over autoregressive (AR) decoding (Figure 7, Table B.3).
| Model | AR (tok/s) | SD (tok/s) | SSD (tok/s) | Speedup vs SD | Speedup vs AR |
|---|---|---|---|---|---|
| Llama-3.1-70B/1B | 54.7 | 161.8 | 255.8 | 1.58× | 4.68× |
| Qwen-3-32B/0.6B | 88.8 | 136.8 | 203.8 | 1.49× | 2.29× |
Figure 7 (End-to-End Evaluation) shows that SSD strictly improves the throughput-latency Pareto frontier across batch sizes, achieving lower latency without sacrificing throughput despite using additional compute (the draft model runs on separate hardware—1× H100 for drafting vs 4× H100 for the target).
2. Key Innovation: Parallelizing Speculation and Verification
Figure 1 illustrates the architectural shift:
- Left (Standard SD): The verifier waits idly while the draft model speculates, creating a sequential dependency.
- Center (SSD): The draft model runs asynchronously on separate hardware (1× H100) and predicts likely verification outcomes preemptively, preparing a "speculation cache." When verification completes, if the outcome was predicted, tokens are returned immediately with zero drafting overhead.
3. The Saguaro Algorithm: Three Key Optimizations
A. Saguaro Cache: Geometric Fan-Out Strategy (Section 4.1)
To predict which verification outcomes to prepare for, Saguaro uses a geometric fan-out strategy (Theorem 12) rather than uniform allocation.
- Figure 3 shows cache hit rates follow a power law: rejection rates scale as $1/F^r$ where $F$ is fan-out.
- Figure 4 demonstrates that geometric fan-out (allocating more guesses to earlier positions in the sequence) improves cache hit rates by up to 90% and end-to-end speed, especially at higher temperatures where uniform strategies fail.
B. Saguaro Sampling: Balancing Acceptance vs. Cache Hits (Section 4.2)
There is a fundamental tension between:
- High acceptance rate (draft tokens match target distribution)
- High cache hit rate (predicting the bonus token correctly)
Figure 5 shows Saguaro sampling introduces a tunable hyperparameter $C \in [0,1]$ that downweights probabilities on cached tokens during drafting. This:
- Increases residual probability mass on cached tokens (making the bonus token easier to predict)
- Creates a trade-off curve where lower $C$ → higher cache hits but lower acceptance rate
- Theorem 15 proves cache hit rate increases monotonically as $C \to 0$
C. Saguaro Fallback: Optimal Cache Miss Handling (Section 4.3)
When the cache misses (verification outcome not predicted), the optimal strategy depends on batch size:
Figure 6 shows:
- Small batches: Use the slow, high-quality primary speculator as backup
- Large batches: Switch to a fast backup speculator (e.g., random tokens or n-grams) at critical batch size $b^*$ (Theorem 17)
- At batch size 16, scaling draft compute (more GPUs → larger fan-out) continues to improve speedups
4. Theoretical Guarantees
- Corollary 8: SSD is strictly faster than SD whenever cache hit rate $p_{\text{hit}} > 0$
- Corollary 9 (Speedup Sandwich): Bounds the speedup of SSD over SD as proportional to $(1 + T_{\text{SD}}) \cdot \frac{E_{\text{hit}}}{E_{\text{SD}}} \cdot p_{\text{hit}}$, where $T_{\text{SD}}$ is draft latency and $E$ represents expected tokens generated
5. Implementation Architecture
The system uses a custom PyTorch inference engine with:
- PagedAttention and continuous batching
- Custom sparse attention masks (Figure 8
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper