| name: AgentDebuggerEnv |
| version: "1.0.0" |
| description: > |
| An OpenEnv-compliant RL training environment where LLM agents learn to debug |
| Python code through structured multi-turn hypothesis-driven reasoning. |
| The agent forms hypotheses, tests them, and refines iteratively over up to 5 turns. |
| Trained via GRPO on Qwen2.5-Coder-7B-Instruct with curriculum learning across |
| 3 bug difficulty tiers. Reward design follows Masud et al. (2026) execution-based |
| + process-based taxonomy and Ibrahim et al. (2024) potential-based shaping. |
| domain: software_engineering |
| tags: |
| - openenv |
| - debugging |
| - reinforcement-learning |
| - grpo |
| - curriculum-learning |
| - python |
| - code-reasoning |
| - hypothesis-driven |
| - agentic-reasoning |
| - code-repair |
| - software-engineering |
| observation_type: structured |
| action_type: structured |
| reward_type: dense |
| episode_termination: action_or_step_limit |
| observation_space: |
| type: object |
| properties: |
| buggy_code: |
| type: string |
| description: The Python function containing the bug |
| error_message: |
| type: string |
| description: Error output or test failure description seen at episode start |
| test_results: |
| type: object |
| description: Results of running current test suite |
| turn_number: |
| type: integer |
| description: Current turn within episode (0-indexed, max 4) |
| history: |
| type: array |
| description: Previous turns with agent outputs and rewards |
| action_space: |
| type: object |
| properties: |
| structured_response: |
| type: string |
| description: > |
| Agent response in required format: |
| OBSERVATION: [text] |
| HYPOTHESIS: [text] |
| CONFIDENCE: [low|medium|high] |
| ACTION: [inspect_lines|run_tests|propose_fix|request_context|give_up] |
| DETAIL: [text] |
| reward_range: [-0.5, 1.0] |
| max_episode_steps: 5 |
| inference_script: inference.py |
| tasks: |
| - id: easy |
| name: Single Function Off-By-One Bug |
| difficulty: easy |
| max_attempts: 5 |
| max_steps: 8 |
| tests_total: 8 |
| description: > |
| Binary search with an off-by-one termination condition. |
| Clear error message, 1-2 iterations expected. |
| - id: medium |
| name: Red Herring — Interdependent Function Bug |
| difficulty: medium |
| max_attempts: 7 |
| max_steps: 15 |
| tests_total: 10 |
| description: > |
| Authentication module where error points to the wrong function. |
| Agent must trace data flow backwards from symptom to root cause. |
| - id: hard |
| name: Concurrency Race Condition |
| difficulty: hard |
| max_attempts: 10 |
| max_steps: 25 |
| tests_total: 8 |
| description: > |
| Thread-safe counter with a race condition invisible to sequential tests. |
| Agent must design a concurrent test to surface the bug, then fix it. |
| baseline: |
| model: meta-llama/Llama-3.1-70B-Instruct |
| script: inference.py |
| mean_score: 0.51 |
| scores: |
| easy: 0.85 |
| medium: 0.50 |
| hard: 0.18 |
| author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)" |
| |
| license: MIT |
| huggingface_space: shashaank0707/AgentDebugger-env |
| api_base_url_env_var: API_BASE_URL |
| model_name_env_var: MODEL_NAME |
| hf_token_env_var: HF_TOKEN |
|
|