arxiv:2603.03251

Speculative Speculative Decoding

Published on Mar 3

Authors:

Abstract

Speculative speculative decoding (SSD) accelerates autoregressive decoding by parallelizing speculation and verification operations through preemptive prediction of verification outcomes.

AI-generated summary

Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying them in parallel with a single target model forward pass. However, speculative decoding itself relies on a sequential dependence between speculation and verification. We introduce speculative speculative decoding (SSD) to parallelize these operations. While a verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome is then in the predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely. We identify three key challenges presented by speculative speculative decoding, and suggest principled methods to solve each. The result is Saguaro, an optimized SSD algorithm. Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines.

View arXiv page View PDF Add to collection

Community

mishig

about 7 hours ago

Here are the main results from "Speculative Speculative Decoding" (SSD), which introduces a framework to parallelize drafting and verification in LLM inference:

1. Core Performance Results

SSD achieves up to 2× speedup over optimized speculative decoding (SD) and up to 5× over autoregressive (AR) decoding (Figure 7, Table B.3).

Model	AR (tok/s)	SD (tok/s)	SSD (tok/s)	Speedup vs SD	Speedup vs AR
Llama-3.1-70B/1B	54.7	161.8	255.8	1.58×	4.68×
Qwen-3-32B/0.6B	88.8	136.8	203.8	1.49×	2.29×

Figure 7 (End-to-End Evaluation) shows that SSD strictly improves the throughput-latency Pareto frontier across batch sizes, achieving lower latency without sacrificing throughput despite using additional compute (the draft model runs on separate hardware—1× H100 for drafting vs 4× H100 for the target).

2. Key Innovation: Parallelizing Speculation and Verification

Figure 1 illustrates the architectural shift:

Left (Standard SD): The verifier waits idly while the draft model speculates, creating a sequential dependency.
Center (SSD): The draft model runs asynchronously on separate hardware (1× H100) and predicts likely verification outcomes preemptively, preparing a "speculation cache." When verification completes, if the outcome was predicted, tokens are returned immediately with zero drafting overhead.

3. The Saguaro Algorithm: Three Key Optimizations

A. Saguaro Cache: Geometric Fan-Out Strategy (Section 4.1)

To predict which verification outcomes to prepare for, Saguaro uses a geometric fan-out strategy (Theorem 12) rather than uniform allocation.

Figure 3 shows cache hit rates follow a power law: rejection rates scale as $1/F^r$ where $F$ is fan-out.
Figure 4 demonstrates that geometric fan-out (allocating more guesses to earlier positions in the sequence) improves cache hit rates by up to 90% and end-to-end speed, especially at higher temperatures where uniform strategies fail.

B. Saguaro Sampling: Balancing Acceptance vs. Cache Hits (Section 4.2)

There is a fundamental tension between:

High acceptance rate (draft tokens match target distribution)
High cache hit rate (predicting the bonus token correctly)

Figure 5 shows Saguaro sampling introduces a tunable hyperparameter $C \in [0,1]$ that downweights probabilities on cached tokens during drafting. This:

Increases residual probability mass on cached tokens (making the bonus token easier to predict)
Creates a trade-off curve where lower $C$ → higher cache hits but lower acceptance rate
Theorem 15 proves cache hit rate increases monotonically as $C \to 0$

C. Saguaro Fallback: Optimal Cache Miss Handling (Section 4.3)

When the cache misses (verification outcome not predicted), the optimal strategy depends on batch size:

Figure 6 shows:

Small batches: Use the slow, high-quality primary speculator as backup
Large batches: Switch to a fast backup speculator (e.g., random tokens or n-grams) at critical batch size $b^*$ (Theorem 17)
At batch size 16, scaling draft compute (more GPUs → larger fan-out) continues to improve speedups

4. Theoretical Guarantees

Corollary 8: SSD is strictly faster than SD whenever cache hit rate $p_{\text{hit}} > 0$
Corollary 9 (Speedup Sandwich): Bounds the speedup of SSD over SD as proportional to $(1 + T_{\text{SD}}) \cdot \frac{E_{\text{hit}}}{E_{\text{SD}}} \cdot p_{\text{hit}}$, where $T_{\text{SD}}$ is draft latency and $E$ represents expected tokens generated

5. Implementation Architecture

The system uses a custom PyTorch inference engine with:

PagedAttention and continuous batching
Custom sparse attention masks (Figure 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.03251 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.03251 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.03251 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.