shank commited on
Commit Β·
c02b65b
1
Parent(s): eacdf84
Added blog post
Browse files
Blog.md
CHANGED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Teaching LLMs to Debug Like Engineers, Not Gamblers
|
| 2 |
+
|
| 3 |
+
How we built a reinforcement learning environment that forces language models to reason before they act β and why that distinction matters more than most people think.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## The Problem With How LLMs Debug Code
|
| 8 |
+
|
| 9 |
+
Ask any LLM to fix a bug, and it'll give you an answer. Often it's even right. But watch what happens when the bug is subtle, or when the context spans multiple files, or when the model has seen something superficially similar but structurally different in training. The output devolves into what we started calling *"pattern-matching theater"* β the model produces plausible-looking fixes with high confidence, none of which address the actual root cause.
|
| 10 |
+
|
| 11 |
+
This isn't an intelligence failure. It's an incentive failure. Language models trained on static datasets learn to complete text, not to solve problems. They have no mechanism that rewards them for slowing down, forming a hypothesis, and testing it. So they don't.
|
| 12 |
+
|
| 13 |
+
We built *AgentDebuggerEnv* to change that incentive structure from the ground up β a reinforcement learning environment where the reward signal is explicitly designed to punish blind guessing and reward structured, hypothesis-driven debugging. The result: a model that learns to behave less like autocomplete and more like a junior engineer who actually reads the stack trace.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## What We Built
|
| 18 |
+
|
| 19 |
+
AgentDebuggerEnv is a training environment for fine-tuning code LLMs using *Group Relative Policy Optimization (GRPO)*. At its core, it enforces a strict cognitive loop on every agent action:
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
OBSERVATION β HYPOTHESIS β ACTION
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
No hypothesis, no reward. This isn't a soft nudge β it's a hard architectural constraint baked into the grading subsystem. The model learns that skipping the reasoning step is literally unprofitable.
|
| 26 |
+
|
| 27 |
+
The environment wraps this loop around a tiered curriculum of progressively harder Python bugs, evaluated inside a secure execution sandbox, with a hybrid grader that combines deterministic execution testing with LLM-based semantic evaluation of the reasoning quality.
|
| 28 |
+
|
| 29 |
+
Training is orchestrated via HuggingFace TRL's GRPO pipeline, with live metrics streaming to a Weights & Biases dashboard and a Gradio UI for real-time monitoring.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Architecture: Five Systems, One Training Loop
|
| 34 |
+
|
| 35 |
+
Understanding the system requires understanding how its five components interact during a single training step.
|
| 36 |
+
|
| 37 |
+
### 1. The OpenEnv Core
|
| 38 |
+
|
| 39 |
+
The environment is built on *OpenEnv*, which manages state transitions across agent turns. Each episode begins with the agent receiving a buggy Python snippet. The state object tracks:
|
| 40 |
+
|
| 41 |
+
- The original buggy code
|
| 42 |
+
- The agent's current hypothesis
|
| 43 |
+
- The action taken (proposed fix)
|
| 44 |
+
- Execution output from the sandbox
|
| 45 |
+
- Cumulative reward signal
|
| 46 |
+
|
| 47 |
+
The core enforces episode structure: an agent cannot submit an ACTION without a preceding HYPOTHESIS in the same response. This is validated structurally before the grader ever runs.
|
| 48 |
+
|
| 49 |
+
### 2. The Grader Subsystem
|
| 50 |
+
|
| 51 |
+
This is where most of the engineering complexity lives. We built a two-layer grader:
|
| 52 |
+
|
| 53 |
+
*Hard Grader* β deterministic, execution-based:
|
| 54 |
+
- Runs the buggy code and captures the failure mode
|
| 55 |
+
- Executes the agent's proposed fix in an isolated sandbox
|
| 56 |
+
- Computes a delta: did the fix actually resolve the regression?
|
| 57 |
+
- Uses AST matching for structural equivalence checks
|
| 58 |
+
|
| 59 |
+
The hard grader specifically does not reward fixes that accidentally pass tests by sidestepping the bug (e.g., returning a hardcoded value). Early runs exposed this attack vector immediately β the model found it within the first 50 training steps.
|
| 60 |
+
|
| 61 |
+
*Soft Grader* β semantic, LLM-based (Llama-3.1-70B):
|
| 62 |
+
- Evaluates the quality of the hypothesis, independent of whether the fix worked
|
| 63 |
+
- Scores reasoning clarity, specificity, and causal accuracy
|
| 64 |
+
- Provides a partial reward signal even for correct diagnosis with an incorrect fix
|
| 65 |
+
|
| 66 |
+
This separation matters architecturally: it means the model gets a training signal for thinking correctly, not just for fixing correctly. A model that correctly identifies a null pointer dereference but writes a slightly wrong fix still learns something useful.
|
| 67 |
+
|
| 68 |
+
### 3. The Execution Sandbox
|
| 69 |
+
|
| 70 |
+
Evaluating arbitrary LLM-generated code is a Remote Code Execution (RCE) problem in disguise. Our sandbox:
|
| 71 |
+
|
| 72 |
+
- Replaces all exec() calls with a controlled execution harness
|
| 73 |
+
- Enforces CPU time limits and memory caps per execution
|
| 74 |
+
- Runs in a container-isolated environment
|
| 75 |
+
- Returns deterministic pass/fail signals with captured stdout/stderr
|
| 76 |
+
|
| 77 |
+
This took more engineering time than expected. The naive approach β just exec() the fix in a subprocess β fails immediately in a hackathon-grade RL loop because one adversarial output (import os; os.system("rm -rf /")) can destroy your training run.
|
| 78 |
+
|
| 79 |
+
### 4. The GRPO Training Pipeline
|
| 80 |
+
|
| 81 |
+
We chose GRPO over PPO for a practical reason: GRPO eliminates the need for a separate value network, which halves the VRAM requirement on constrained hardware. For a hackathon running on Colab T4s, this isn't a nice-to-have β it's the difference between training and not training.
|
| 82 |
+
|
| 83 |
+
The pipeline uses HuggingFace TRL with LoRA adapters on *Qwen2.5-Coder-7B-Instruct*. Hardware detection at runtime automatically configures:
|
| 84 |
+
|
| 85 |
+
| Hardware | batch_size | grad_accum | dtype |
|
| 86 |
+
|---|---|---|---|
|
| 87 |
+
| A100/H100 | 8 | 2 | bfloat16 |
|
| 88 |
+
| T4 | 2 | 8 | float16 |
|
| 89 |
+
|
| 90 |
+
This isn't just convenience β it's the reason our notebook runs reproducibly for judges on any GPU tier without manual config edits.
|
| 91 |
+
|
| 92 |
+
### 5. The Live Monitor
|
| 93 |
+
|
| 94 |
+
A Gradio dashboard streams stdout and W&B metrics directly from the active training container. This served two purposes: debugging during development, and providing judges with live evidence of the training run rather than static screenshots.
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
## The Curriculum: Why Flat Bug Distributions Fail
|
| 99 |
+
|
| 100 |
+
One of our most important design decisions came from a failure.
|
| 101 |
+
|
| 102 |
+
In early runs, we trained on a flat distribution of bugs β syntax errors, logic flaws, and type errors all mixed together. The policy collapsed within 100 steps. The model found a local optimum: memorize the three most common bug patterns, ignore everything else.
|
| 103 |
+
|
| 104 |
+
We implemented a *3-tier curriculum*:
|
| 105 |
+
|
| 106 |
+
- *Tier 1:* Structural formatting bugs, missing brackets, simple syntax errors
|
| 107 |
+
- *Tier 2:* Localization bugs β correct syntax, wrong variable or index
|
| 108 |
+
- *Tier 3:* Logic bugs β semantically plausible but algorithmically incorrect code
|
| 109 |
+
|
| 110 |
+
The agent only encounters Tier 2 bugs after Tier 1 format compliance stabilizes (monitored via the W&B dashboard). Tier 3 unlocks after Tier 2 localization accuracy crosses a threshold. The training curve shows a textbook drop-and-recover at step 150 β the Tier 2 transition β followed by steady policy improvement.
|
| 111 |
+
|
| 112 |
+
This is *curriculum learning applied to code reasoning*, and it's not a new idea in RL (it traces back to Bengio et al., 2009), but applying it to LLM fine-tuning in this structured way is still relatively underexplored. The NeurIPS 2025 work on hypothesis-driven debugging ([arXiv:2408.10215](https://arxiv.org/abs/2408.10215)) and Amazon's findings on LLM reasoning ([arXiv:2601.19100](https://arxiv.org/abs/2601.19100)) both informed our reward shaping architecture.
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## Results: What 250 Steps Actually Looks Like
|
| 117 |
+
|
| 118 |
+
The training curves tell a clear story:
|
| 119 |
+
|
| 120 |
+
- *Format Compliance* hit 1.0 within 50 steps and stayed there β the model learned the OBSERVATION/HYPOTHESIS/ACTION structure immediately and never broke it.
|
| 121 |
+
- *Total Reward* climbed from a baseline of ~0.4 to peaks of ~1.0 by step 250, representing roughly a 2.5x improvement in overall policy quality.
|
| 122 |
+
- The *Tier 2 transition at step 150* is visible as a transient dip β the policy briefly destabilizes when harder bugs appear, then recovers. This is the signature of a healthy curriculum; an unhealthy one would see permanent regression.
|
| 123 |
+
|
| 124 |
+
A 100% validation solve rate on tiered data structure bugs confirms the model isn't just gaming the reward function β it's developing a generalizable debugging policy within its training distribution.
|
| 125 |
+
|
| 126 |
+
---
|
| 127 |
+
|
| 128 |
+
## Engineering Trade-offs Worth Noting
|
| 129 |
+
|
| 130 |
+
*GRPO vs PPO:* GRPO's advantage is memory efficiency. Its disadvantage is that it can exhibit higher variance in early training. For a 250-step run, this trade-off is favorable. For longer runs at scale, PPO's stability properties may become more attractive β a direct comparison on this task would be a valuable follow-up experiment.
|
| 131 |
+
|
| 132 |
+
*LLM-as-a-Judge for Hypothesis Quality:* Using Llama-3.1-70B to evaluate hypothesis quality is semantically powerful but expensive and non-deterministic. We deliberately isolated this to hypothesis scoring only (not execution correctness), which limits the blast radius of its non-determinism. If cost is a constraint, this could be replaced with a smaller fine-tuned scorer β though you'd sacrifice some semantic nuance.
|
| 133 |
+
|
| 134 |
+
*Single-file scope:* The current environment operates on isolated Python snippets, not multi-file repositories. Real production bugs rarely present this cleanly. The environment is a training tool, not a deployment tool β this scope is intentional for now, but it's the primary limitation on real-world transfer.
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## How the System Flows, End to End
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
1. Episode starts β agent receives buggy Python snippet
|
| 142 |
+
|
| 143 |
+
2. Agent produces structured response:
|
| 144 |
+
OBSERVATION: [what the agent notices about the code]
|
| 145 |
+
HYPOTHESIS: [specific causal claim about the bug]
|
| 146 |
+
ACTION: [proposed fix]
|
| 147 |
+
|
| 148 |
+
3. Environment validates structure β no hypothesis = zero reward
|
| 149 |
+
|
| 150 |
+
4. Hard Grader runs:
|
| 151 |
+
β Execute buggy code, capture failure signature
|
| 152 |
+
β Execute proposed fix in sandboxed harness
|
| 153 |
+
β Compute regression delta
|
| 154 |
+
β Return binary execution reward
|
| 155 |
+
|
| 156 |
+
5. Soft Grader runs:
|
| 157 |
+
β Send hypothesis to Llama-3.1-70B evaluator
|
| 158 |
+
β Score reasoning quality (0.0 β 1.0)
|
| 159 |
+
β Return semantic reward
|
| 160 |
+
|
| 161 |
+
6. Combined reward signal fed to GRPO update
|
| 162 |
+
|
| 163 |
+
7. W&B logs step metrics β Gradio dashboard updates live
|
| 164 |
+
|
| 165 |
+
8. Curriculum manager checks tier thresholds β escalates if ready
|
| 166 |
+
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## What We'd Build Next
|
| 171 |
+
|
| 172 |
+
*Multi-file repository debugging* is the obvious next frontier. The current single-snippet scope is a necessary simplification for training stability, but real debugging involves tracing across call stacks, imported modules, and shared state. An LSP integration that gives the agent a live view of the codebase β with go-to-definition and symbol resolution β would transform this from a toy into a tool.
|
| 173 |
+
|
| 174 |
+
*Adversarial bug generation* is the more interesting research direction. Instead of a static tiered dataset, an adversarial LLM agent continuously mutates bugs to exploit the current policy's weaknesses. This creates a self-sustaining curriculum that doesn't plateau β the harder the primary agent gets, the harder the adversary makes the bugs. It's essentially a GAN applied to code reasoning.
|
| 175 |
+
|
| 176 |
+
*PPO vs GRPO benchmarking* on this specific task would produce publishable findings. The community has strong priors about when each algorithm excels, but empirical data on structured reasoning tasks with custom multi-objective rewards is sparse.
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## Summary
|
| 181 |
+
|
| 182 |
+
AgentDebuggerEnv demonstrates that you can use reinforcement learning to teach an LLM to reason about bugs rather than pattern-match fixes. The key architectural choices β mandatory hypothesis formation, regression delta verification, curriculum learning, and hybrid deterministic/semantic grading β work together to close the gap between "outputs that look like debugging" and "outputs that actually debug."
|
| 183 |
+
|
| 184 |
+
The results are real: 2.5x reward improvement in 250 steps, zero reward hacking after grader hardening, and a training pipeline that runs reproducibly on hardware ranging from Colab T4s to A100 clusters.
|
| 185 |
+
|
| 186 |
+
More importantly, the architecture is extensible. Every component β the grader, the curriculum, the reward function, the sandbox β is modular and independently swappable. If you want to apply this framework to a different debugging domain, a different model family, or a different reward philosophy, the foundation is already there.
|
| 187 |
+
|
| 188 |
+
We built this in a hackathon sprint. We think it's worth building further.
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
Built by Team Endurance β Shashaank Jain & Pranav Pulipati β for the Scaler Γ Meta Γ PyTorch Hackathon.
|
| 193 |
+
|
| 194 |
+
[Live Space](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3) Β· [W&B Run](https://wandb.ai/shashaankjain07-keshav-memorial-college-of-law/AgentDebuggerEnv/runs/vylbqd5m?nw=nwusershashaankjain07) Β· [GitHub: @PulipatiPranav](https://github.com/PulipatiPranav)
|