Spaces:

ceoavinash
/

codearena-rl

Sleeping

App Files Files Community

CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning

by ceoavinash - opened 8 days ago

Discussion

ceoavinash

Owner 8 days ago

•

edited 7 days ago

CodeArena: Teaching LLMs to Debug Code Through Reinforcement Learning

An OpenEnv-compatible RL environment for iterative code repair with adaptive difficulty, hybrid grading, and self-improving agent memory.

The Problem: Why We Built CodeArena

Every major AI coding assistant — GitHub Copilot, Cursor, Devin — is benchmarked on code generation. Can it write a function? Can it complete a snippet?

But here's the gap nobody is talking about: what happens when the code breaks?

In production, code breaks constantly. A real developer doesn't just generate code — they spend the majority of their time reading error logs, reasoning about failure, iterating on fixes, and recovering from mistakes. This iterative debugging loop is the core skill that separates a junior developer from a senior one.

Yet there is no standardized RL environment to train or evaluate an LLM on this capability. HumanEval measures one-shot generation. MBPP measures function completion. Neither measures what happens across multiple repair attempts when the first fix doesn't work.

CodeArena is the first open-source, OpenEnv-compatible reinforcement learning environment built specifically for iterative code repair.

How CodeArena Works

The Loop

CodeArena simulates the real-world debugging workflow:

1. Agent receives buggy Python code + error log
2. Agent proposes a fix
3. Environment executes the fix in a sandboxed subprocess
4. Environment runs unit tests and scores the fix
5. Agent receives reward + updated error log
6. Repeat up to 5 steps

This is fundamentally different from one-shot code generation benchmarks. The agent must:

Read and interpret error messages from previous attempts
Track what it has already tried (repeated fixes are penalized)
Decide whether to patch locally or rewrite entirely
Optimize for efficiency, not just correctness

Architecture

Agent ─── POST /reset ──→ CodeArena Server ──→ Returns buggy_code + error_log
  │                            │
  │                            ├── Task Loader (9 tasks across 5 categories)
  │                            ├── Sandboxed Executor (subprocess + timeout)
  │                            ├── Hybrid Grader (tests + LLM judge)
  │                            ├── Algorithm Detector (complexity analysis)
  │                            └── Agent Memory (self-improving store)
  │
  └── POST /step ────────→ Returns observation, reward, done, info

The server is a standard FastAPI application that implements the OpenEnv specification (/reset, /step, /state). The openenv.yaml manifest defines the observation space (buggy code, error log, test results, previous attempts) and the action space (proposed fix).

What Makes CodeArena Special (Environment Innovation)

1. Hybrid Grader: Tests + LLM-as-Judge

Most coding benchmarks use a single signal: did the tests pass? This creates a fundamental problem — agents learn to produce code that passes weak tests through reward-hacking (e.g., hardcoding expected outputs, or producing syntactically correct but semantically broken code).

CodeArena uses a Hybrid Grader with six weighted components:

Component	Weight	What It Measures
`compile_score`	15%	Code compiles without syntax errors
`test_pass_ratio`	35%	Fraction of unit tests passed
`efficiency_score`	30%	Execution time vs. optimal runtime
`llm_correctness`	10%	LLM judge: is the fix logically correct?
`llm_security`	5%	LLM judge: does the fix introduce vulnerabilities?
`llm_quality`	5%	LLM judge: is the code readable and maintainable?

Additionally, two penalties are applied:

Step penalty (-0.01 × step_count): Rewards faster fixes
Novelty penalty (-0.10): Penalizes submitting the same fix twice

The LLM judge is called via the OpenAI-compatible API (configurable to GPT-4o-mini, local Ollama, or HuggingFace Inference). When no API key is available, it falls back to neutral scores (0.5), ensuring the environment always runs.

Why this matters for training: The heavy 30% weight on efficiency means that an agent that passes all tests with an O(n²) brute-force solution gets a significantly lower reward than one that uses an O(n) algorithm. This forces the model to learn algorithmic reasoning, not just syntax repair.

2. Adaptive Curriculum (Theme #4: Self-Improvement)

CodeArena doesn't use a fixed task set. It features an Adaptive Curriculum that tracks the agent's rolling average reward over recent episodes and automatically adjusts difficulty:

Condition	Transition
avg reward > 0.80 on Easy	→ Medium
avg reward > 0.75 on Medium	→ Hard
avg reward < 0.35 on Hard	→ Medium (de-escalate)
avg reward < 0.35 on Medium	→ Easy (de-escalate)

This is activated by passing task_id: "auto" to the /reset endpoint.

Why this matters: The agent cannot plateau by memorizing solutions to easy tasks. As soon as it masters syntax errors, the environment pushes it to algorithmic logic bugs. If it struggles, it recovers on easier tasks before trying again. This creates a natural recursive skill amplification loop — the environment drives the agent's own capability growth.

3. Algorithm Detection + Adaptive Prompting

CodeArena includes a built-in Algorithm Detector (server/algorithm_detector.py) that:

Classifies the problem type (max subarray, two-sum, binary search, sliding window, etc.) from code patterns
Estimates time complexity by analyzing loop nesting depth (O(1) → O(n) → O(n²) → O(n³))
Generates targeted optimization hints (e.g., "Use Kadane's Algorithm O(n): curr = max(num, curr+num)")

When the AI fixer generates a repair, the algorithm detector provides adaptive prompt suffixes based on the current reward level:

Low reward (< 0.4): "Focus on correctness. Fix syntax errors first."
Medium reward (0.4–0.7): "Fix edge cases and logic bugs."
High reward (> 0.7): "Optimize for performance. Use O(n) algorithms."

4. Self-Improving Agent Memory

CodeArena includes a persistent Agent Memory system (server/memory.py) that stores the best solution found for each task. When the agent encounters the same task type again, it can retrieve its previous best solution as a starting point.

This creates a genuine self-improvement loop:

Episode 1: Agent fixes syntax → reward 0.45
Episode 5: Agent recalls its best previous fix, optimizes further → reward 0.72
Episode 10: Agent has accumulated enough memory to skip basic fixes entirely → reward 0.88

The memory is persisted to agent_memory.json and survives server restarts.

5. Rich Task Diversity

CodeArena ships with 9 tasks across 5 categories:

Category	Tasks	What It Tests
Easy (syntax)	Missing colons, wrong indentation	Basic Python syntax repair
Medium (logic)	Off-by-one errors, wrong conditions	Algorithmic reasoning
Hard (optimization)	O(n²) → O(n) refactoring	Algorithm design
Type Errors	Wrong types, missing conversions	Type system understanding
Security Bugs	SQL injection, path traversal	Security awareness

Each task includes:

Buggy source code
Multiple unit tests
An optimal execution time baseline (for efficiency scoring)

Training Pipeline: TRL GRPO on CodeArena

We trained a coding model using Hugging Face TRL's GRPO (Group Relative Policy Optimization) trainer, connecting it directly to the CodeArena environment as a live reward signal.

How It Works

# The reward function queries CodeArena's /step endpoint
def codearena_reward_func(completions, prompts):
    rewards = []
    for completion in completions:
        proposed_fix = completion[0].get('content', '').strip()
        res = httpx.post("http://localhost:7860/step",
                         json={"proposed_fix": proposed_fix})
        reward = res.json().get('reward', 0.0)
        rewards.append(reward)
    return rewards

# GRPO training with CodeArena as the reward environment
trainer = GRPOTrainer(
    model=model,
    reward_funcs=codearena_reward_func,
    args=GRPOConfig(
        output_dir="./codearena-grpo",
        learning_rate=1e-5,
        max_steps=50,
        per_device_train_batch_size=2,
    ),
    train_dataset=dataset,
)
trainer.train()

The key insight is that the reward is not static — it comes from actually executing the agent's proposed code against real unit tests in a sandboxed environment, then grading it with the hybrid scorer. This is true environment-in-the-loop RL, not reward modeling on a frozen dataset.

Training Results

We trained Qwen/Qwen2.5-Coder-1.5B on the m-a-p/Code-Feedback dataset with CodeArena as the reward environment.

Fig 1: Episode reward over training steps. The rolling 10-step average shows clear learning and improvement from initial near-zero rewards to consistent 0.65+ rewards.

Fig 2: Average reward broken down by task category. The agent learned to handle syntax and type errors reliably, while algorithmic optimization tasks remain challenging — exactly the behavior we'd expect from a curriculum that pushes harder problems as the agent improves.

Fig 3: Task Difficulty Performance Matrix showing the mean, max, and standard deviation of rewards across difficulty levels.

Fig 4: Complexity Distribution highlighting the frequency of O(1) vs O(n) solutions generated by the agent.

Fig 5: Reward Distribution by Fixer Method, comparing the performance of the Ollama LLM to the built-in pattern-based fixer.

Fig 6: Cumulative Reward over time, highlighting the total accumulated reward across multiple episodes.

Fig 7: LLM Fixer Method Performance Comparison scatter plot showing the individual performance data points of Ollama vs Builtin methods.

Reproducing the Training

The complete training pipeline is available as a Colab notebook:
👉 Open in Google Colab

The notebook:

Installs all dependencies (trl, transformers, httpx)
Clones the CodeArena repository
Starts the FastAPI backend server
Loads Qwen2.5-Coder-1.5B with GRPO configuration
Trains against the live environment
Logs rewards per step

Live Demo: Try It Now

The fully-functional CodeArena environment is deployed on Hugging Face Spaces with a React frontend dashboard:

👉 https://huggingface.co/spaces/ceoavinash/codearena-rl

What You Can Do on the Live Demo:

Start an Episode: Select Easy/Medium/Hard difficulty and load a buggy code task
Manual Fix: Edit the code yourself and click "Run Step" to see your reward
AI Fix: Click the ✨ AI FIX button to have the built-in AI repair agent (powered by Qwen2.5-Coder-3B-Instruct via HuggingFace Serverless Inference) generate a fix
Agent Mode: Toggle auto-pilot to watch the agent autonomously fix → test → fix → test in a loop
Sandbox Mode: Paste your own arbitrary Python code and watch the environment evaluate it

The dashboard shows real-time reward components (compile score, test ratio, efficiency), a terminal log of every step, and a reward chart that updates live.

Technical Deep Dive

Sandboxed Execution

All agent-submitted code runs in an isolated subprocess with:

AST syntax validation before execution (catches syntax errors without running code)
Timeout enforcement (configurable per task, default 5s)
Temporary file execution (code is written to a temp file, executed, then deleted)
Structured output parsing (test results are communicated via a |CODEARENA_STATS| sentinel)

This ensures that malicious or infinite-loop code cannot crash the server.

AI Code Fixer Pipeline

The built-in AI fixer (server/ai_fixer.py, 600+ lines) implements a sophisticated multi-fallback pipeline:

TGI / HuggingFace Serverless API (Priority 1): Calls Qwen2.5-Coder-3B-Instruct for high-quality fixes
Local Ollama (Priority 2): Falls back to a local LLM if available
AST Pattern-Based Fixer (Priority 3): 20+ pattern rules for common Python bugs:
- Missing colons after def, if, for, while
- Missing return statements
- Wrong comparison operators (= → ==)
- Missing self parameter in class methods
- Incorrect indentation repair
- And many more

The fixer also includes a code validator that catches fixes worse than the original (e.g., introduces new syntax errors), and a self-critique loop that re-checks the generated code before returning it.

Complexity-Reward Tracking

Every fix is logged to complexity_rewards.csv with:

Task ID
Reward achieved
Detected time complexity
Fix method (TGI/Ollama/built-in)

This creates a research dataset that proves our core hypothesis: agents that produce O(n) solutions consistently receive higher rewards than those producing O(n²) solutions.

Why CodeArena Matters

Writing code is a solved problem. GPT-4, Claude, Gemini — they can all generate working functions from natural language descriptions.

Debugging code autonomously — reasoning about failure, iterating on fixes, recovering from wrong turns — is not solved.

Every production coding system will eventually face broken code. There is no other standardized RL environment that trains and benchmarks iterative repair at this level. CodeArena fills that gap with:

A hybrid grader that prevents reward-hacking
An adaptive curriculum for continuous self-improvement
A persistent memory for cross-episode learning
A rich task library spanning syntax, logic, algorithms, types, and security
Full OpenEnv compatibility for plug-and-play evaluation

CodeArena is infrastructure. Plug any model in. Run it. Get a number. Compare it against the baseline. Train on it. Watch it improve.

Links & Resources

Resource	Link
🤗 Live Demo (HF Space)	huggingface.co/spaces/ceoavinash/codearena-rl
📓 Training Notebook (Colab)	Open in Colab
💻 Source Code (GitHub)	github.com/havinashpatil/meta
📋 OpenEnv Manifest	openenv.yaml
Youtube	youtube.com

Built for the OpenEnv Hackathon India 2026 — Theme #4: Self-Improvement

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment