BaxBench × Prime Intellect — Secure Backend Code Generation
Team: Oof team Author: Aidarbek Suleimenov (@idarbek)
My submission is an RL environment, model evaluation, and RL post-trained model created from the original BaxBench secure-backend-code benchmark.
I wrapped the benchmark as a Prime Intellect verifiers environment, used it to evaluate Laguna-XS.2 against GPT-5.5, and then RL-trained Laguna-XS.2 on the train split. During training, the eval score went from 0.061 → 0.115 (+87% relative), but I couldn’t benchmark the final trained model due to limits with Prime Intellect’s LoRA deployments for Laguna-XS.2.
Why this matters
As frontier models get better at writing code, so do the adversaries who use them. CrowdStrike’s 2026 Global Threat Report states that AI-enabled attacks are up roughly 89% year over year.
Since more and more new code is being written by AI, it’s important that models treat security as one of the key dimensions to optimise for.
What's in this repo
| File | What it is |
|---|---|
README.md |
This file. |
baxbench.parquet |
The 392-row task table (28 scenarios × 14 frameworks) extracted from the original BaxBench, enriched with everything needed for sandbox execution: API spec, allowed packages, entrypoint command, port, multi-file flag, etc. |
lora_adapter/ |
The Laguna-XS.2 LoRA adapter produced by 40 GRPO steps on the BaxBench train split (adapter_config.json + adapter_model.safetensors, 4.6 GB). |
eval_training_curve.png |
Held-out pass@1 over the course of RL training (step 0 → step 40). |
baseline_vs_gpt55.txt |
Baseline comparison: Laguna-XS.2 vs GPT-5.5 on 100 random BaxBench tasks. |
External links:
- Prime Intellect environment: https://app.primeintellect.ai/dashboard/environments/aidarbek/baxbench
- Original BaxBench paper: https://arxiv.org/abs/2502.11844 — Vero et al., 2025
- Original BaxBench dataset: https://huggingface.co/datasets/LogicStar/BaxBench
- Base model: poolside/Laguna-XS.2
What was built
1. BaxBench wrapped as a Prime Intellect verifiers environment
BaxBench upstream is a Python CLI that ships with Docker, runs each generated backend in a container, and tests it with both functional tests (does the API work?) and security tests (can it be exploited?). I ported it into a single vf.SingleTurnEnv that:
- Reads the 392-task parquet, builds the OpenAPI-style prompt for each (matching the upstream template exactly).
- Per rollout, spins up a Prime Intellect sandbox with the right Docker image for the task's framework.
- Uploads the model-generated code, installs scenario-specific deps (
apt: ffmpeg / poppler-utils / …,pip: imageio / pdfplumber / …), starts the server, and runs every upstream functional + security test against it. - Returns reward =
1.0iff all functional tests pass AND zero CWEs are flagged — matching BaxBench's "secure pass@k" metric. Sub-metrics (functional pass rate, security pass rate) are also tracked.
The wrapper bundles a copy of the upstream BaxBench test suite into each sandbox and monkey-patches its Docker-bound helpers (load_file_from_docker, process_still_running, execute_sql_on_docker) to operate on the local filesystem / process tree — sound because the server and tests share a kernel namespace inside the sandbox.
Key engineering wins:
- True held-out split. The env exposes
split_by="scenario" | "framework" | "random" | "none"so the RL training set never overlaps with the eval set at the scenario level (default: 22 train scenarios, 6 held-out). Samesplit_seedon training and eval keeps the holdout pinned across runs. - Per-rollout logging that survives
prime train logs --tail. Every sandbox emits anOKorBADline with stderr + server log tails on failure — no more "results.json missing" without context. - Per-scenario dep install with retry. Only 4 of 28 scenarios need extra system packages; installing per-scenario keeps each exec call short enough to dodge the sandbox gateway's 502s.
2. Baseline benchmark: Laguna-XS.2 vs GPT-5.5
100-sample pass@1 on a random Python-framework slice of BaxBench, scored with the env above:
| Metric | poolside/Laguna-XS.2 | openai/gpt-5.5 | Δ |
|---|---|---|---|
| pass@1 (secure + functional) | 0.260 (26/100) | 0.550 (55/100) | +0.290 |
| functional_pass_rate | 0.395 | 0.710 | +0.315 |
| security_pass_rate | 0.566 | 0.827 | +0.261 |
| wall clock | 10.5 min | 11.3 min | +0.8 min |
| avg output tokens | 5,150 | 2,942 | −2,208 |
Why GPT-5.5? Simply because I had existing credits for OpenAI API :)
Head-to-head on the same 100 tasks: GPT-5.5 won 33 tasks Laguna didn't; Laguna won 4 tasks GPT-5.5 didn't (clustered on Flask/aiohttp). The gap is bigger on functional correctness (+0.315) than on security (+0.261), meaning a smaller open model is closer to GPT-5.5 on "writing secure code" than on "writing working code" — exactly the signal that says RL on this benchmark is worth running.
3. RL training of Laguna-XS.2
GRPO on the train split (22 scenarios × 4 Python frameworks = 88 tasks):
- 40 gradient steps,
batch_size=16,rollouts_per_example=8(2 GRPO groups of 8 per step) - LR 5e-6, temperature 0.7
- Online eval against the 6 held-out scenarios at steps 0, 10, 20, 30, 40
Held-out pass@1 climbed 0.061 → 0.115 (+87% relative) on scenarios the model never saw during training:
The LoRA adapter from step 39 is checked into lora_adapter/. Final inference-time eval on the trained adapter was blocked because poolside/Laguna-XS.2 is currently gated for LoRA deployment on Prime Intellect's inference infra (Error: Base model is not currently available for LoRA deployment). The training-time eval curve above measures the same held-out 24 tasks at every checkpoint, scored by the same env code — apples-to-apples, just smaller sample size than the full 100-task baseline.
Unfortunately, I didn't have a time to properly benchmark post-trained model, the LoRa deployments weren't available, and running self-host model on Prime instances took too much time.
How to reproduce
Install the env locally:
prime env install aidarbek/baxbench
Run the baseline:
prime eval run aidarbek/baxbench \
-m openai/gpt-5.5 \
-n 100 -r 1 -c 16 \
-a '{"split_by": "none"}'
Train Laguna-XS.2 with GRPO (requires Laguna training access + Prime Hosted Training):
prime train configs/rl/laguna-baxbench.toml -e PRIME_API_KEY -y
where configs/rl/laguna-baxbench.toml matches the values described above (40 steps, batch_size=16, rollouts_per_example=8, split_by="scenario" with test_size=0.2, split_seed=42).
To serve this LoRA adapter once Prime enables Laguna LoRA deployment:
prime deployments create <adapter_id> -y
prime eval run aidarbek/baxbench \
-m <deployed_model_id> \
-n 100 -r 1 -c 16 \
-a '{"split_by": "scenario", "test_size": 0.2, "split_seed": 42}'
Limitations and honest caveats
- 24-task held-out set has low resolution. Each percentage point ≈ 0.27 tasks. The 0.061 → 0.115 trend is real but individual data points are noisy.
- Truncation rate dropped 79% → 58% during training. Some of the improvement may come from "produce a shorter, finishable answer" rather than "write better code." That's still a real and useful capability.
- No standalone post-RL eval. Until Prime enables LoRA serving for Laguna, the training-time eval is the only signal. Once unblocked, a full 100-task standalone
prime eval runis the next number to publish. - Python only. The wrapper supports Go / JavaScript / Ruby / PHP / Rust tasks via per-language Docker images, but only Python is currently dep-tested end-to-end. Default
languages=["Python"]reflects this.
Citation
If you use this work, please also cite the original BaxBench paper:
@article{vero2025baxbenchllmsgeneratecorrect,
title = {BaxBench: Can LLMs Generate Correct and Secure Backends?},
author = {Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
year = {2025},
eprint = {2502.11844},
archivePrefix = {arXiv},
}
Model tree for poolside-laguna-hackathon/baxbench
Base model
poolside/Laguna-XS.2