BaxBench × Prime Intellect — Secure Backend Code Generation

Team: Oof team Author: Aidarbek Suleimenov (@idarbek)

My submission is an RL environment, model evaluation, and RL post-trained model created from the original BaxBench secure-backend-code benchmark.

I wrapped the benchmark as a Prime Intellect verifiers environment, used it to evaluate Laguna-XS.2 against GPT-5.5, and then RL-trained Laguna-XS.2 on the train split. During training, the eval score went from 0.061 → 0.115 (+87% relative), but I couldn’t benchmark the final trained model due to limits with Prime Intellect’s LoRA deployments for Laguna-XS.2.


Why this matters

As frontier models get better at writing code, so do the adversaries who use them. CrowdStrike’s 2026 Global Threat Report states that AI-enabled attacks are up roughly 89% year over year.

Since more and more new code is being written by AI, it’s important that models treat security as one of the key dimensions to optimise for.


What's in this repo

File What it is
README.md This file.
baxbench.parquet The 392-row task table (28 scenarios × 14 frameworks) extracted from the original BaxBench, enriched with everything needed for sandbox execution: API spec, allowed packages, entrypoint command, port, multi-file flag, etc.
lora_adapter/ The Laguna-XS.2 LoRA adapter produced by 40 GRPO steps on the BaxBench train split (adapter_config.json + adapter_model.safetensors, 4.6 GB).
eval_training_curve.png Held-out pass@1 over the course of RL training (step 0 → step 40).
baseline_vs_gpt55.txt Baseline comparison: Laguna-XS.2 vs GPT-5.5 on 100 random BaxBench tasks.

External links:


What was built

1. BaxBench wrapped as a Prime Intellect verifiers environment

BaxBench upstream is a Python CLI that ships with Docker, runs each generated backend in a container, and tests it with both functional tests (does the API work?) and security tests (can it be exploited?). I ported it into a single vf.SingleTurnEnv that:

  1. Reads the 392-task parquet, builds the OpenAPI-style prompt for each (matching the upstream template exactly).
  2. Per rollout, spins up a Prime Intellect sandbox with the right Docker image for the task's framework.
  3. Uploads the model-generated code, installs scenario-specific deps (apt: ffmpeg / poppler-utils / …, pip: imageio / pdfplumber / …), starts the server, and runs every upstream functional + security test against it.
  4. Returns reward = 1.0 iff all functional tests pass AND zero CWEs are flagged — matching BaxBench's "secure pass@k" metric. Sub-metrics (functional pass rate, security pass rate) are also tracked.

The wrapper bundles a copy of the upstream BaxBench test suite into each sandbox and monkey-patches its Docker-bound helpers (load_file_from_docker, process_still_running, execute_sql_on_docker) to operate on the local filesystem / process tree — sound because the server and tests share a kernel namespace inside the sandbox.

Key engineering wins:

  • True held-out split. The env exposes split_by="scenario" | "framework" | "random" | "none" so the RL training set never overlaps with the eval set at the scenario level (default: 22 train scenarios, 6 held-out). Same split_seed on training and eval keeps the holdout pinned across runs.
  • Per-rollout logging that survives prime train logs --tail. Every sandbox emits an OK or BAD line with stderr + server log tails on failure — no more "results.json missing" without context.
  • Per-scenario dep install with retry. Only 4 of 28 scenarios need extra system packages; installing per-scenario keeps each exec call short enough to dodge the sandbox gateway's 502s.

2. Baseline benchmark: Laguna-XS.2 vs GPT-5.5

100-sample pass@1 on a random Python-framework slice of BaxBench, scored with the env above:

Metric poolside/Laguna-XS.2 openai/gpt-5.5 Δ
pass@1 (secure + functional) 0.260 (26/100) 0.550 (55/100) +0.290
functional_pass_rate 0.395 0.710 +0.315
security_pass_rate 0.566 0.827 +0.261
wall clock 10.5 min 11.3 min +0.8 min
avg output tokens 5,150 2,942 −2,208

Why GPT-5.5? Simply because I had existing credits for OpenAI API :)

Head-to-head on the same 100 tasks: GPT-5.5 won 33 tasks Laguna didn't; Laguna won 4 tasks GPT-5.5 didn't (clustered on Flask/aiohttp). The gap is bigger on functional correctness (+0.315) than on security (+0.261), meaning a smaller open model is closer to GPT-5.5 on "writing secure code" than on "writing working code" — exactly the signal that says RL on this benchmark is worth running.

3. RL training of Laguna-XS.2

GRPO on the train split (22 scenarios × 4 Python frameworks = 88 tasks):

  • 40 gradient steps, batch_size=16, rollouts_per_example=8 (2 GRPO groups of 8 per step)
  • LR 5e-6, temperature 0.7
  • Online eval against the 6 held-out scenarios at steps 0, 10, 20, 30, 40

Held-out pass@1 climbed 0.061 → 0.115 (+87% relative) on scenarios the model never saw during training:

Training eval curve

The LoRA adapter from step 39 is checked into lora_adapter/. Final inference-time eval on the trained adapter was blocked because poolside/Laguna-XS.2 is currently gated for LoRA deployment on Prime Intellect's inference infra (Error: Base model is not currently available for LoRA deployment). The training-time eval curve above measures the same held-out 24 tasks at every checkpoint, scored by the same env code — apples-to-apples, just smaller sample size than the full 100-task baseline.

Unfortunately, I didn't have a time to properly benchmark post-trained model, the LoRa deployments weren't available, and running self-host model on Prime instances took too much time.


How to reproduce

Install the env locally:

prime env install aidarbek/baxbench

Run the baseline:

prime eval run aidarbek/baxbench \
  -m openai/gpt-5.5 \
  -n 100 -r 1 -c 16 \
  -a '{"split_by": "none"}'

Train Laguna-XS.2 with GRPO (requires Laguna training access + Prime Hosted Training):

prime train configs/rl/laguna-baxbench.toml -e PRIME_API_KEY -y

where configs/rl/laguna-baxbench.toml matches the values described above (40 steps, batch_size=16, rollouts_per_example=8, split_by="scenario" with test_size=0.2, split_seed=42).

To serve this LoRA adapter once Prime enables Laguna LoRA deployment:

prime deployments create <adapter_id> -y
prime eval run aidarbek/baxbench \
  -m <deployed_model_id> \
  -n 100 -r 1 -c 16 \
  -a '{"split_by": "scenario", "test_size": 0.2, "split_seed": 42}'

Limitations and honest caveats

  • 24-task held-out set has low resolution. Each percentage point ≈ 0.27 tasks. The 0.061 → 0.115 trend is real but individual data points are noisy.
  • Truncation rate dropped 79% → 58% during training. Some of the improvement may come from "produce a shorter, finishable answer" rather than "write better code." That's still a real and useful capability.
  • No standalone post-RL eval. Until Prime enables LoRA serving for Laguna, the training-time eval is the only signal. Once unblocked, a full 100-task standalone prime eval run is the next number to publish.
  • Python only. The wrapper supports Go / JavaScript / Ruby / PHP / Rust tasks via per-language Docker images, but only Python is currently dep-tested end-to-end. Default languages=["Python"] reflects this.

Citation

If you use this work, please also cite the original BaxBench paper:

@article{vero2025baxbenchllmsgeneratecorrect,
  title  = {BaxBench: Can LLMs Generate Correct and Secure Backends?},
  author = {Mark Vero and Niels Mündler and Victor Chibotaru and Veselin Raychev and Maximilian Baader and Nikola Jovanović and Jingxuan He and Martin Vechev},
  year   = {2025},
  eprint = {2502.11844},
  archivePrefix = {arXiv},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for poolside-laguna-hackathon/baxbench

Finetuned
(23)
this model

Dataset used to train poolside-laguna-hackathon/baxbench

Paper for poolside-laguna-hackathon/baxbench