Spaces:
Sleeping
CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning
CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning
An OpenEnv-compatible RL environment for iterative code repair with adaptive difficulty, hybrid grading, and self-improving agent memory.
The Problem: Why We Built CodeArena
Every major AI coding assistant โ GitHub Copilot, Cursor, Devin โ is benchmarked on code generation. Can it write a function? Can it complete a snippet?
But here's the gap nobody is talking about: what happens when the code breaks?
In production, code breaks constantly. A real developer doesn't just generate code โ they spend the majority of their time reading error logs, reasoning about failure, iterating on fixes, and recovering from mistakes. This iterative debugging loop is the core skill that separates a junior developer from a senior one.
Yet there is no standardized RL environment to train or evaluate an LLM on this capability. HumanEval measures one-shot generation. MBPP measures function completion. Neither measures what happens across multiple repair attempts when the first fix doesn't work.
CodeArena is the first open-source, OpenEnv-compatible reinforcement learning environment built specifically for iterative code repair.
How CodeArena Works
The Loop
CodeArena simulates the real-world debugging workflow:
1. Agent receives buggy Python code + error log
2. Agent proposes a fix
3. Environment executes the fix in a sandboxed subprocess
4. Environment runs unit tests and scores the fix
5. Agent receives reward + updated error log
6. Repeat up to 5 steps
This is fundamentally different from one-shot code generation benchmarks. The agent must:
- Read and interpret error messages from previous attempts
- Track what it has already tried (repeated fixes are penalized)
- Decide whether to patch locally or rewrite entirely
- Optimize for efficiency, not just correctness
Architecture
Agent โโโ POST /reset โโโ CodeArena Server โโโ Returns buggy_code + error_log
โ โ
โ โโโ Task Loader (9 tasks across 5 categories)
โ โโโ Sandboxed Executor (subprocess + timeout)
โ โโโ Hybrid Grader (tests + LLM judge)
โ โโโ Algorithm Detector (complexity analysis)
โ โโโ Agent Memory (self-improving store)
โ
โโโ POST /step โโโโโโโโโ Returns observation, reward, done, info
The server is a standard FastAPI application that implements the OpenEnv specification (/reset, /step, /state). The openenv.yaml manifest defines the observation space (buggy code, error log, test results, previous attempts) and the action space (proposed fix).
What Makes CodeArena Special (Environment Innovation)
1. Hybrid Grader: Tests + LLM-as-Judge
Most coding benchmarks use a single signal: did the tests pass? This creates a fundamental problem โ agents learn to produce code that passes weak tests through reward-hacking (e.g., hardcoding expected outputs, or producing syntactically correct but semantically broken code).
CodeArena uses a Hybrid Grader with six weighted components:
| Component | Weight | What It Measures |
|---|---|---|
compile_score |
15% | Code compiles without syntax errors |
test_pass_ratio |
35% | Fraction of unit tests passed |
efficiency_score |
30% | Execution time vs. optimal runtime |
llm_correctness |
10% | LLM judge: is the fix logically correct? |
llm_security |
5% | LLM judge: does the fix introduce vulnerabilities? |
llm_quality |
5% | LLM judge: is the code readable and maintainable? |
Additionally, two penalties are applied:
- Step penalty (
-0.01 ร step_count): Rewards faster fixes - Novelty penalty (
-0.10): Penalizes submitting the same fix twice
The LLM judge is called via the OpenAI-compatible API (configurable to GPT-4o-mini, local Ollama, or HuggingFace Inference). When no API key is available, it falls back to neutral scores (0.5), ensuring the environment always runs.
Why this matters for training: The heavy 30% weight on efficiency means that an agent that passes all tests with an O(nยฒ) brute-force solution gets a significantly lower reward than one that uses an O(n) algorithm. This forces the model to learn algorithmic reasoning, not just syntax repair.
2. Adaptive Curriculum (Theme #4: Self-Improvement)
CodeArena doesn't use a fixed task set. It features an Adaptive Curriculum that tracks the agent's rolling average reward over recent episodes and automatically adjusts difficulty:
| Condition | Transition |
|---|---|
| avg reward > 0.80 on Easy | โ Medium |
| avg reward > 0.75 on Medium | โ Hard |
| avg reward < 0.35 on Hard | โ Medium (de-escalate) |
| avg reward < 0.35 on Medium | โ Easy (de-escalate) |
This is activated by passing task_id: "auto" to the /reset endpoint.
Why this matters: The agent cannot plateau by memorizing solutions to easy tasks. As soon as it masters syntax errors, the environment pushes it to algorithmic logic bugs. If it struggles, it recovers on easier tasks before trying again. This creates a natural recursive skill amplification loop โ the environment drives the agent's own capability growth.
3. Algorithm Detection + Adaptive Prompting
CodeArena includes a built-in Algorithm Detector (server/algorithm_detector.py) that:
- Classifies the problem type (max subarray, two-sum, binary search, sliding window, etc.) from code patterns
- Estimates time complexity by analyzing loop nesting depth (O(1) โ O(n) โ O(nยฒ) โ O(nยณ))
- Generates targeted optimization hints (e.g., "Use Kadane's Algorithm O(n):
curr = max(num, curr+num)")
When the AI fixer generates a repair, the algorithm detector provides adaptive prompt suffixes based on the current reward level:
- Low reward (< 0.4): "Focus on correctness. Fix syntax errors first."
- Medium reward (0.4โ0.7): "Fix edge cases and logic bugs."
- High reward (> 0.7): "Optimize for performance. Use O(n) algorithms."
4. Self-Improving Agent Memory
CodeArena includes a persistent Agent Memory system (server/memory.py) that stores the best solution found for each task. When the agent encounters the same task type again, it can retrieve its previous best solution as a starting point.
This creates a genuine self-improvement loop:
- Episode 1: Agent fixes syntax โ reward 0.45
- Episode 5: Agent recalls its best previous fix, optimizes further โ reward 0.72
- Episode 10: Agent has accumulated enough memory to skip basic fixes entirely โ reward 0.88
The memory is persisted to agent_memory.json and survives server restarts.
5. Rich Task Diversity
CodeArena ships with 9 tasks across 5 categories:
| Category | Tasks | What It Tests |
|---|---|---|
| Easy (syntax) | Missing colons, wrong indentation | Basic Python syntax repair |
| Medium (logic) | Off-by-one errors, wrong conditions | Algorithmic reasoning |
| Hard (optimization) | O(nยฒ) โ O(n) refactoring | Algorithm design |
| Type Errors | Wrong types, missing conversions | Type system understanding |
| Security Bugs | SQL injection, path traversal | Security awareness |
Each task includes:
- Buggy source code
- Multiple unit tests
- An optimal execution time baseline (for efficiency scoring)
Training Pipeline: TRL GRPO on CodeArena
We trained a coding model using Hugging Face TRL's GRPO (Group Relative Policy Optimization) trainer, connecting it directly to the CodeArena environment as a live reward signal.
How It Works
# The reward function queries CodeArena's /step endpoint
def codearena_reward_func(completions, prompts):
rewards = []
for completion in completions:
proposed_fix = completion[0].get('content', '').strip()
res = httpx.post("http://localhost:7860/step",
json={"proposed_fix": proposed_fix})
reward = res.json().get('reward', 0.0)
rewards.append(reward)
return rewards
# GRPO training with CodeArena as the reward environment
trainer = GRPOTrainer(
model=model,
reward_funcs=codearena_reward_func,
args=GRPOConfig(
output_dir="./codearena-grpo",
learning_rate=1e-5,
max_steps=50,
per_device_train_batch_size=2,
),
train_dataset=dataset,
)
trainer.train()
The key insight is that the reward is not static โ it comes from actually executing the agent's proposed code against real unit tests in a sandboxed environment, then grading it with the hybrid scorer. This is true environment-in-the-loop RL, not reward modeling on a frozen dataset.
Training Results
We trained Qwen/Qwen2.5-Coder-1.5B on the m-a-p/Code-Feedback dataset with CodeArena as the reward environment.

Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.

Fig 2: Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging โ exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.

Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.

Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.

Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.

Fig 6: Cumulative Reward over time, highlighting the total accumulated reward across multiple episodes.

Fig 7: LLM Fixer Method Performance Comparison scatter plot showing the individual performance data points of Ollama vs Builtin methods.
Reproducing the Training
The complete training pipeline is available as a Colab notebook:
๐ Open in Google Colab
The notebook:
- Installs all dependencies (
trl,transformers,httpx) - Clones the CodeArena repository
- Starts the FastAPI backend server
- Loads
Qwen2.5-Coder-1.5Bwith GRPO configuration - Trains against the live environment
- Logs rewards per step
Live Demo: Try It Now
The fully-functional CodeArena environment is deployed on Hugging Face Spaces with a React frontend dashboard:
๐ https://huggingface.co/spaces/ceoavinash/codearena-rl
What You Can Do on the Live Demo:
- Start an Episode: Select Easy/Medium/Hard difficulty and load a buggy code task
- Manual Fix: Edit the code yourself and click "Run Step" to see your reward
- AI Fix: Click the โจ AI FIX button to have the built-in AI repair agent (powered by
Qwen2.5-Coder-3B-Instructvia HuggingFace Serverless Inference) generate a fix - Agent Mode: Toggle auto-pilot to watch the agent autonomously fix โ test โ fix โ test in a loop
- Sandbox Mode: Paste your own arbitrary Python code and watch the environment evaluate it
The dashboard shows real-time reward components (compile score, test ratio, efficiency), a terminal log of every step, and a reward chart that updates live.
Technical Deep Dive
Sandboxed Execution
All agent-submitted code runs in an isolated subprocess with:
- AST syntax validation before execution (catches syntax errors without running code)
- Timeout enforcement (configurable per task, default 5s)
- Temporary file execution (code is written to a temp file, executed, then deleted)
- Structured output parsing (test results are communicated via a
|CODEARENA_STATS|sentinel)
This ensures that malicious or infinite-loop code cannot crash the server.
AI Code Fixer Pipeline
The built-in AI fixer (server/ai_fixer.py, 600+ lines) implements a sophisticated multi-fallback pipeline:
- TGI / HuggingFace Serverless API (Priority 1): Calls
Qwen2.5-Coder-3B-Instructfor high-quality fixes - Local Ollama (Priority 2): Falls back to a local LLM if available
- AST Pattern-Based Fixer (Priority 3): 20+ pattern rules for common Python bugs:
- Missing colons after
def,if,for,while - Missing
returnstatements - Wrong comparison operators (
=โ==) - Missing
selfparameter in class methods - Incorrect indentation repair
- And many more
- Missing colons after
The fixer also includes a code validator that catches fixes worse than the original (e.g., introduces new syntax errors), and a self-critique loop that re-checks the generated code before returning it.
Complexity-Reward Tracking
Every fix is logged to complexity_rewards.csv with:
- Task ID
- Reward achieved
- Detected time complexity
- Fix method (TGI/Ollama/built-in)
This creates a research dataset that proves our core hypothesis: agents that produce O(n) solutions consistently receive higher rewards than those producing O(nยฒ) solutions.
Why CodeArena Matters
Writing code is a solved problem. GPT-4, Claude, Gemini โ they can all generate working functions from natural language descriptions.
Debugging code autonomously โ reasoning about failure, iterating on fixes, recovering from wrong turns โ is not solved.
Every production coding system will eventually face broken code. There is no other standardized RL environment that trains and benchmarks iterative repair at this level. CodeArena fills that gap with:
- A hybrid grader that prevents reward-hacking
- An adaptive curriculum for continuous self-improvement
- A persistent memory for cross-episode learning
- A rich task library spanning syntax, logic, algorithms, types, and security
- Full OpenEnv compatibility for plug-and-play evaluation
CodeArena is infrastructure. Plug any model in. Run it. Get a number. Compare it against the baseline. Train on it. Watch it improve.
Links & Resources
| Resource | Link |
|---|---|
| ๐ค Live Demo (HF Space) | huggingface.co/spaces/ceoavinash/codearena-rl |
| ๐ Training Notebook (Colab) | Open in Colab |
| ๐ป Source Code (GitHub) | github.com/havinashpatil/meta |
| ๐ OpenEnv Manifest | openenv.yaml |
| Youtube | youtube.com |
Built for the OpenEnv Hackathon India 2026 โ Theme #4: Self-Improvement