Lean Laguna β€” Laguna XS.2 + DFlash, lossless single-GPU speedup

Laguna XS.2 generates 2.76Γ— faster on one GPU β€” 19.6 β†’ 54.2 tok/s β€” with byte-identical greedy output (0 / 14 mismatches), then carries that into ~64% cheaper RL rollouts.

Speculative decoding with Poolside's DFlash speculator on Laguna XS.2, served in vLLM on a single GPU. The throughput win is directly measured; the output is provably lossless under greedy decoding (token-for-token identical to baseline) and distribution-preserving under sampling. Unlike lossy compression β€” pruning, low-bit quant β€” this changes nothing about what the model emits: it cuts the number of expensive forward passes, not the model.

Poolside Research Hackathon β€” Foundations track. Scored on Generalisability Β· Reproducibility Β· Technical contributions; this sits on reduce cost & latency for a novel research idea.

Results

Same prompts, same max_tokens, temperature 0 (greedy), single H200, --tensor-parallel-size 1, vLLM 0.22.0. Only --speculative-config differs between the two servers.

Metric Baseline + DFlash Ξ”
tok/s β€” mixed-difficulty (N=14) 19.6 54.2 2.76Γ— ↑
tok/s β€” trivial (N=20) 19.5 48.1 2.47Γ— ↑
tok/s β€” mixed re-run, fresh GPU (N=14) 19.8 52.1 2.63Γ— ↑
greedy parity β€” identical 0 mismatches (0/14, 0/20, 0/14) βœ“

tok/s is the headline β€” directly measured wall-clock, larger on the harder set than the trivial one, byte-identical in both. We don't quote acceptance-length Ο„: vLLM's /metrics pinned it at the Ξ³+1 ceiling on both runs (a metrics artifact), so the claim rests on measured speedup + parity. No TTFT row (the harness didn't isolate it); byte-identical greedy output β‡’ identical pass@1 by construction.

Ξ³-sweep (throughput-optimal is also lossless)

Decode tok/s over the 14-prompt set, baseline 19.95 tok/s, byte-parity checked at every Ξ³ (results/gamma_sweep.json):

Ξ³ 3 5 7 (default) 9 (Ξ³*) 11
speedup 2.24Γ— 2.64Γ— 2.59Γ— 2.65Γ— 2.43Γ—
lossless βœ“0/14 βœ“0/14 βœ“0/14 βœ“0/14 βœ“0/14

Rises then falls (classic acceptance/overhead tradeoff), peaks at Ξ³*=9, default Ξ³=7 within ~2.4% of optimum β€” and every Ξ³ is byte-lossless. The throughput-optimal Ξ³ is exactly lossless.

Cheaper RL rollouts β€” the generalisability + frontier story

Rollout generation is decode-bound, so the measured 2.76Γ— is β‰ˆ2.76Γ— fewer GPU-seconds per rollout β€” a ~64% lower cost per rollout at any fixed $/hr. Because greedy completions are byte-identical, the reward signal is unchanged by construction β†’ cheaper rollouts at zero quality cost. The speedup is a decode-time property, so it drops into any verifiers/ART-style RL trainer whose rollout phase is OpenAI-compatible vLLM inference β€” add --speculative-config to the rollout server.

The environment is executable and public. spec_rl runs end-to-end via prime eval run and hosted prime train, published as art87able/spec-rl (one prime env install). A 12-problem HumanEval slice scores mean dense reward 0.85 (results/spec_rl_eval.json); the self-served vLLM baseline reproduces 0.85 exactly (results/reward_invariance.json); a fresh non-HumanEval, Adaption-generated set scores 0.917 (results/spec_rl_adaption_eval.json) β€” answering "is it just HumanEval?".

We post-trained Laguna for real β€” not just evaluated it. A free hosted GRPO run (prime train, 20 steps, batch 64Γ—8, lr 1e-6) with online eval on a disjoint held-out split (HumanEval 50–74, eval_base_model=true): held-out dense reward 0.90 (base) β†’ 0.96 (post-trained), every checkpoint β‰₯ base (results/rl_after.json). The gain is modest and bounded by setup β€” the split is near-saturated and greedy MoE eval isn't bit-reproducible (results/determinism_check.json: 0.85 / 1.0 / 1.0), so +0.06 sits inside the noise band, but the trend is consistently positive. What matters: the environment trains the model, not just scores it, on data the policy never trained on. Cost: $0. A larger 40-step follow-up (train pool 0–99, disjoint held-out 100–163, results/rl_after_big.json) on a less-saturated split showed no net held-out gain (base 0.819 β†’ final 0.817; peak 0.839 at step 5, then regression) β€” an honest null that bounds the effect on harder problems, reported as such.

The open problem: in RL the policy moves every batch, so a base-trained drafter drifts β†’ acceptance decays β†’ the speedup erodes across training. Within a batch the policy is frozen, so the per-batch win is real; the frontier is keeping the drafter useful as the policy moves (periodic distillation, hidden-state-conditioned drafters, amortizing re-sync). That's the novel-research-idea axis.

Method

  • Target: poolside/Laguna-XS.2 β€” 33.4B-total / 3B-active MoE, single GPU, FP8 native, 128K context.
  • Draft: poolside/Laguna-XS.2-speculator.dflash β€” 0.6B block-diffusion-style draft model.
  • How: DFlash proposes Ξ³=7 tokens; Laguna verifies all 7 in one forward pass and commits the longest matching prefix + one bonus token. Under greedy the target only commits its own argmax β†’ output token-identical to baseline; under sampling, vLLM rejection sampling preserves the target distribution.
  • Regime: the win lands at low-batch / memory-bound decode (single-GPU, single-agent); it shrinks at high batch / compute-bound. The lossless claim holds regardless β€” it's a property of verification.

Composes with orthogonal optimizations

Speculative decoding attacks the number of forward passes; the other levers attack different bottlenecks, so they stack on one GPU rather than compete:

Lever What it cuts Stacks with DFlash?
KV-cache quant / sparsity (fp8 / VQ / learned head-split) KV memory Yes β€” measured. Ξ³=9 + fp8 KV β†’ 2.40Γ— with ~2Γ— less KV (results/kv_stack.json)
Weight quant (FP8 / INT4 / NVFP4) weight memory Yes β€” orthogonal (weight-quant of attention doesn't itself speed decode β€” that's the speculator's job)
Expert pruning / MoE→dense distill active params Yes — the speculator rides on whatever target is served
Edge formats (GGUF) CPU/edge footprint Orthogonal target-side change; speculation layers on top

The distinguishing property is losslessness: greedy DFlash output is byte-identical (0 mismatches at every Ξ³), so the speedup can't change what the model emits β€” and therefore can't alter the security profile (CWEs) of generated code. Lossy levers can silently change outputs; for code that ships, a 2.76Γ— speedup that provably can't is the safer default.

Reproduce

One self-contained command on Hugging Face Jobs (serves baseline β†’ measures β†’ re-serves with DFlash β†’ measures β†’ byte-parity):

hf jobs uv run --flavor h200 --timeout 1500 --detach --secrets HF_TOKEN scripts/hf_job_ab.py

The one flag that is the experiment (vLLM β‰₯ 0.21, VLLM_USE_DEEP_GEMM=0):

--speculative-config '{"model":"poolside/Laguna-XS.2-speculator.dflash","num_speculative_tokens":7,"method":"dflash"}'

Local two-server flow (scripts/serve_vllm.py + bench/measure.py + evals/humaneval_subset.py --parity) reproduces the table on any CUDA box. The results table is the diff of results/baseline.json and results/dflash.json plus the parity check.

License

Apache 2.0, inheriting poolside/Laguna-XS.2. Does not redistribute weights β€” see NOTICE.md.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for poolside-laguna-hackathon/lean-laguna

Finetuned
(23)
this model