Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

AgentDebugger-training-v3 / openenv.yaml

shank

Update: Started making changes for the hackathon

a55c81d 9 days ago

3.21 kB

	name: AgentDebuggerEnv
	version: "1.0.0"
	description: >
	An OpenEnv-compliant RL training environment where LLM agents learn to debug
	Python code through structured multi-turn hypothesis-driven reasoning.
	The agent forms hypotheses, tests them, and refines iteratively over up to 5 turns.
	Trained via GRPO on Qwen2.5-Coder-7B-Instruct with curriculum learning across
	3 bug difficulty tiers. Reward design follows Masud et al. (2026) execution-based
	+ process-based taxonomy and Ibrahim et al. (2024) potential-based shaping.
	domain: software_engineering
	tags:
	- openenv
	- debugging
	- reinforcement-learning
	- grpo
	- curriculum-learning
	- python
	- code-reasoning
	- hypothesis-driven
	- agentic-reasoning
	- code-repair
	- software-engineering
	observation_type: structured
	action_type: structured
	reward_type: dense
	episode_termination: action_or_step_limit
	observation_space:
	type: object
	properties:
	buggy_code:
	type: string
	description: The Python function containing the bug
	error_message:
	type: string
	description: Error output or test failure description seen at episode start
	test_results:
	type: object
	description: Results of running current test suite
	turn_number:
	type: integer
	description: Current turn within episode (0-indexed, max 4)
	history:
	type: array
	description: Previous turns with agent outputs and rewards
	action_space:
	type: object
	properties:
	structured_response:
	type: string
	description: >
	Agent response in required format:
	OBSERVATION: [text]
	HYPOTHESIS: [text]
	CONFIDENCE: [low\|medium\|high]
	ACTION: [inspect_lines\|run_tests\|propose_fix\|request_context\|give_up]
	DETAIL: [text]
	reward_range: [-0.5, 1.0]
	max_episode_steps: 5
	inference_script: inference.py
	tasks:
	- id: easy
	name: Single Function Off-By-One Bug
	difficulty: easy
	max_attempts: 5
	max_steps: 8
	tests_total: 8
	description: >
	Binary search with an off-by-one termination condition.
	Clear error message, 1-2 iterations expected.
	- id: medium
	name: Red Herring — Interdependent Function Bug
	difficulty: medium
	max_attempts: 7
	max_steps: 15
	tests_total: 10
	description: >
	Authentication module where error points to the wrong function.
	Agent must trace data flow backwards from symptom to root cause.
	- id: hard
	name: Concurrency Race Condition
	difficulty: hard
	max_attempts: 10
	max_steps: 25
	tests_total: 8
	description: >
	Thread-safe counter with a race condition invisible to sequential tests.
	Agent must design a concurrent test to surface the bug, then fix it.
	baseline:
	model: meta-llama/Llama-3.1-70B-Instruct
	script: inference.py
	mean_score: 0.51
	scores:
	easy: 0.85
	medium: 0.50
	hard: 0.18
	author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
	# Submission Integrity: SHA 5c507c313ff2c209d7b770af6f08cf6ed6ab1568 \| Verified 2026-04-09
	license: MIT
	huggingface_space: shashaank0707/AgentDebugger-env
	api_base_url_env_var: API_BASE_URL
	model_name_env_var: MODEL_NAME
	hf_token_env_var: HF_TOKEN