Lean Laguna β Laguna XS.2 + DFlash, lossless single-GPU speedup
Laguna XS.2 generates 2.76Γ faster on one GPU β 19.6 β 54.2 tok/s β with byte-identical greedy output (0 / 14 mismatches), then carries that into ~64% cheaper RL rollouts.
Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on a single GPU. The throughput win is directly measured; the output is provably lossless under greedy decoding (token-for-token identical to baseline) and distribution-preserving under sampling. Unlike lossy compression β pruning, low-bit quant β this changes nothing about what the model emits: it cuts the number of expensive forward passes, not the model.
Poolside Research Hackathon β Foundations track. Scored on Generalisability Β· Reproducibility Β· Technical contributions; this sits on reduce cost & latency for a novel research idea.
Results
Same prompts, same max_tokens, temperature 0 (greedy), single H200, --tensor-parallel-size 1,
vLLM 0.22.0. Only --speculative-config differs between the two servers.
| Metric | Baseline | + DFlash | Ξ |
|---|---|---|---|
| tok/s β mixed-difficulty (N=14) | 19.6 | 54.2 | 2.76Γ β |
| tok/s β trivial (N=20) | 19.5 | 48.1 | 2.47Γ β |
| tok/s β mixed re-run, fresh GPU (N=14) | 19.8 | 52.1 | 2.63Γ β |
| greedy parity | β | identical | 0 mismatches (0/14, 0/20, 0/14) β |
tok/s is the headline β directly measured wall-clock, larger on the harder set than the trivial
one, byte-identical in both. We don't quote acceptance-length Ο: vLLM's /metrics pinned it at the
Ξ³+1 ceiling on both runs (a metrics artifact), so the claim rests on measured speedup + parity. No TTFT
row (the harness didn't isolate it); byte-identical greedy output β identical pass@1 by construction.
Ξ³-sweep (throughput-optimal is also lossless)
Decode tok/s over the 14-prompt set, baseline 19.95 tok/s, byte-parity checked at every Ξ³
(results/gamma_sweep.json):
| Ξ³ | 3 | 5 | 7 (default) | 9 (Ξ³*) | 11 |
|---|---|---|---|---|---|
| speedup | 2.24Γ | 2.64Γ | 2.59Γ | 2.65Γ | 2.43Γ |
| lossless | β0/14 | β0/14 | β0/14 | β0/14 | β0/14 |
Rises then falls (classic acceptance/overhead tradeoff), peaks at Ξ³*=9, default Ξ³=7 within ~2.4% of optimum β and every Ξ³ is byte-lossless. The throughput-optimal Ξ³ is exactly lossless.
Cheaper RL rollouts β the generalisability + frontier story
Rollout generation is decode-bound, so the measured 2.76Γ is β2.76Γ fewer GPU-seconds per rollout β
a ~64% lower cost per rollout at any fixed $/hr. Because greedy completions are byte-identical, the
reward signal is unchanged by construction β cheaper rollouts at zero quality cost. The speedup is a
decode-time property, so it drops into any verifiers/ART-style RL trainer whose rollout phase is
OpenAI-compatible vLLM inference β add --speculative-config to the rollout server.
The environment is executable and public. spec_rl runs end-to-end via prime eval run and hosted
prime train, published as art87able/spec-rl
(one prime env install). A 12-problem HumanEval slice scores mean dense reward 0.85
(results/spec_rl_eval.json); the self-served vLLM baseline reproduces 0.85 exactly
(results/reward_invariance.json); a fresh non-HumanEval, Adaption-generated set scores 0.917
(results/spec_rl_adaption_eval.json) β answering "is it just HumanEval?".
We post-trained Laguna for real β not just evaluated it. A free hosted GRPO run (prime train,
20 steps, batch 64Γ8, lr 1e-6) with online eval on a disjoint held-out split (HumanEval 50β74,
eval_base_model=true): held-out dense reward 0.90 (base) β 0.96 (post-trained), every checkpoint β₯
base (results/rl_after.json). The gain is modest and bounded by setup β the split is near-saturated
and greedy MoE eval isn't bit-reproducible (results/determinism_check.json: 0.85 / 1.0 / 1.0), so +0.06
sits inside the noise band, but the trend is consistently positive. What matters: the environment
trains the model, not just scores it, on data the policy never trained on. Cost: $0. A larger 40-step follow-up
(train pool 0β99, disjoint held-out 100β163, results/rl_after_big.json) on a less-saturated split
showed no net held-out gain (base 0.819 β final 0.817; peak 0.839 at step 5, then regression) β an
honest null that bounds the effect on harder problems, reported as such.
The open problem: in RL the policy moves every batch, so a base-trained drafter drifts β acceptance decays β the speedup erodes across training. Within a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful as the policy moves (periodic distillation, hidden-state-conditioned drafters, amortizing re-sync). That's the novel-research-idea axis.
Method
- Target:
poolside/Laguna-XS.2β 33.4B-total / 3B-active MoE, single GPU, FP8 native, 128K context. - Draft:
poolside/Laguna-XS.2-speculator.dflashβ 0.6B block-diffusion-style draft model. - How: DFlash proposes Ξ³=7 tokens; Laguna verifies all 7 in one forward pass and commits the longest matching prefix + one bonus token. Under greedy the target only commits its own argmax β output token-identical to baseline; under sampling, vLLM rejection sampling preserves the target distribution.
- Regime: the win lands at low-batch / memory-bound decode (single-GPU, single-agent); it shrinks at high batch / compute-bound. The lossless claim holds regardless β it's a property of verification.
Composes with orthogonal optimizations
Speculative decoding attacks the number of forward passes; the other levers attack different bottlenecks, so they stack on one GPU rather than compete:
| Lever | What it cuts | Stacks with DFlash? |
|---|---|---|
| KV-cache quant / sparsity (fp8 / VQ / learned head-split) | KV memory | Yes β measured. Ξ³=9 + fp8 KV β 2.40Γ with ~2Γ less KV (results/kv_stack.json) |
| Weight quant (FP8 / INT4 / NVFP4) | weight memory | Yes β orthogonal (weight-quant of attention doesn't itself speed decode β that's the speculator's job) |
| Expert pruning / MoEβdense distill | active params | Yes β the speculator rides on whatever target is served |
| Edge formats (GGUF) | CPU/edge footprint | Orthogonal target-side change; speculation layers on top |
The distinguishing property is losslessness: greedy DFlash output is byte-identical (0 mismatches at every Ξ³), so the speedup can't change what the model emits β and therefore can't alter the security profile (CWEs) of generated code. Lossy levers can silently change outputs; for code that ships, a 2.76Γ speedup that provably can't is the safer default.
Reproduce
One self-contained command on Hugging Face Jobs (serves baseline β measures β re-serves with DFlash β measures β byte-parity):
hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py
The one flag that is the experiment (vLLM β₯ 0.21, VLLM_USE_DEEP_GEMM=0):
--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'
Local two-server flow (scripts/serve_vllm.py + bench/measure.py + evals/humaneval_subset.py --parity)
reproduces the table on any CUDA box. The results table is the diff of results/baseline.json and
results/dflash.json plus the parity check.
License
Apache 2.0, inheriting poolside/Laguna-XS.2. Does not redistribute weights β see NOTICE.md.
Model tree for poolside-laguna-hackathon/lean-laguna
Base model
poolside/Laguna-XS.2