Lean Laguna — Laguna XS.2 + DFlash, lossless single-GPU speedup

Laguna XS.2 generates 2.76× faster on one GPU — 19.6 → 54.2 tok/s — with byte-identical greedy output (0 / 14 mismatches), then carries that into ~64% cheaper RL rollouts.

Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on a single GPU. The throughput win is directly measured; the output is provably lossless under greedy decoding (token-for-token identical to baseline) and distribution-preserving under sampling. Unlike lossy compression — pruning, low-bit quant — this changes nothing about what the model emits: it cuts the number of expensive forward passes, not the model.

Poolside Research Hackathon — Foundations track. Scored on Generalisability · Reproducibility · Technical contributions; this sits on reduce cost & latency for a novel research idea.

Results

Same prompts, same max_tokens, temperature 0 (greedy), single H200, --tensor-parallel-size 1, vLLM 0.22.0. Only --speculative-config differs between the two servers.

Metric	Baseline	+ DFlash	Δ
tok/s — mixed-difficulty (N=14)	19.6	54.2	2.76× ↑
tok/s — trivial (N=20)	19.5	48.1	2.47× ↑
tok/s — mixed re-run, fresh GPU (N=14)	19.8	52.1	2.63× ↑
greedy parity	—	identical	0 mismatches (0/14, 0/20, 0/14) ✓

tok/s is the headline — directly measured wall-clock, larger on the harder set than the trivial one, byte-identical in both. We don't quote acceptance-length τ: vLLM's /metrics pinned it at the γ+1 ceiling on both runs (a metrics artifact), so the claim rests on measured speedup + parity. No TTFT row (the harness didn't isolate it); byte-identical greedy output ⇒ identical pass@1 by construction.

γ-sweep (throughput-optimal is also lossless)

Decode tok/s over the 14-prompt set, baseline 19.95 tok/s, byte-parity checked at every γ (results/gamma_sweep.json):

γ	3	5	7 (default)	*9 (γ)**	11
speedup	2.24×	2.64×	2.59×	2.65×	2.43×
lossless	✓0/14	✓0/14	✓0/14	✓0/14	✓0/14

Rises then falls (classic acceptance/overhead tradeoff), peaks at γ*=9, default γ=7 within ~2.4% of optimum — and every γ is byte-lossless. The throughput-optimal γ is exactly lossless.

Cheaper RL rollouts — the generalisability + frontier story

Rollout generation is decode-bound, so the measured 2.76× is ≈2.76× fewer GPU-seconds per rollout — a ~64% lower cost per rollout at any fixed $/hr. Because greedy completions are byte-identical, the reward signal is unchanged by construction → cheaper rollouts at zero quality cost. The speedup is a decode-time property, so it drops into any verifiers/ART-style RL trainer whose rollout phase is OpenAI-compatible vLLM inference — add --speculative-config to the rollout server.

The environment is executable and public. spec_rl runs end-to-end via prime eval run and hosted prime train, published as art87able/spec-rl (one prime env install). A 12-problem HumanEval slice scores mean dense reward 0.85 (results/spec_rl_eval.json); the self-served vLLM baseline reproduces 0.85 exactly (results/reward_invariance.json); a fresh non-HumanEval, Adaption-generated set scores 0.917 (results/spec_rl_adaption_eval.json) — answering "is it just HumanEval?".

We post-trained Laguna for real — not just evaluated it. A free hosted GRPO run (prime train, 20 steps, batch 64×8, lr 1e-6) with online eval on a disjoint held-out split (HumanEval 50–74, eval_base_model=true): held-out dense reward 0.90 (base) → 0.96 (post-trained), every checkpoint ≥ base (results/rl_after.json). The gain is modest and bounded by setup — the split is near-saturated and greedy MoE eval isn't bit-reproducible (results/determinism_check.json: 0.85 / 1.0 / 1.0), so +0.06 sits inside the noise band, but the trend is consistently positive. What matters: the environment trains the model, not just scores it, on data the policy never trained on. Cost: $0. A larger 40-step follow-up (train pool 0–99, disjoint held-out 100–163, results/rl_after_big.json) on a less-saturated split showed no net held-out gain (base 0.819 → final 0.817; peak 0.839 at step 5, then regression) — an honest null that bounds the effect on harder problems, reported as such.

The open problem: in RL the policy moves every batch, so a base-trained drafter drifts → acceptance decays → the speedup erodes across training. Within a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful as the policy moves (periodic distillation, hidden-state-conditioned drafters, amortizing re-sync). That's the novel-research-idea axis.

Method

Target: poolside/Laguna-XS.2 — 33.4B-total / 3B-active MoE, single GPU, FP8 native, 128K context.
Draft: poolside/Laguna-XS.2-speculator.dflash — 0.6B block-diffusion-style draft model.
How: DFlash proposes γ=7 tokens; Laguna verifies all 7 in one forward pass and commits the longest matching prefix + one bonus token. Under greedy the target only commits its own argmax → output token-identical to baseline; under sampling, vLLM rejection sampling preserves the target distribution.
Regime: the win lands at low-batch / memory-bound decode (single-GPU, single-agent); it shrinks at high batch / compute-bound. The lossless claim holds regardless — it's a property of verification.

Composes with orthogonal optimizations

Speculative decoding attacks the number of forward passes; the other levers attack different bottlenecks, so they stack on one GPU rather than compete:

Lever	What it cuts	Stacks with DFlash?
KV-cache quant / sparsity (fp8 / VQ / learned head-split)	KV memory	Yes — measured. γ=9 + fp8 KV → 2.40× with ~2× less KV (`results/kv_stack.json`)
Weight quant (FP8 / INT4 / NVFP4)	weight memory	Yes — orthogonal (weight-quant of attention doesn't itself speed decode — that's the speculator's job)
Expert pruning / MoE→dense distill	active params	Yes — the speculator rides on whatever target is served
Edge formats (GGUF)	CPU/edge footprint	Orthogonal target-side change; speculation layers on top

The distinguishing property is losslessness: greedy DFlash output is byte-identical (0 mismatches at every γ), so the speedup can't change what the model emits — and therefore can't alter the security profile (CWEs) of generated code. Lossy levers can silently change outputs; for code that ships, a 2.76× speedup that provably can't is the safer default.

Reproduce

One self-contained command on Hugging Face Jobs (serves baseline → measures → re-serves with DFlash → measures → byte-parity):

hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py

The one flag that is the experiment (vLLM ≥ 0.21, VLLM_USE_DEEP_GEMM=0):

--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'

Local two-server flow (scripts/serve_vllm.py + bench/measure.py + evals/humaneval_subset.py --parity) reproduces the table on any CUDA box. The results table is the diff of results/baseline.json and results/dflash.json plus the parity check.

License

Apache 2.0, inheriting poolside/Laguna-XS.2. Does not redistribute weights — see NOTICE.md.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for poolside-laguna-hackathon/lean-laguna

Base model

poolside/Laguna-XS.2

Finetuned

(24)

this model