shank commited on
Commit
4057375
Β·
1 Parent(s): e93446d

Update: Final README.md update

Browse files
Files changed (1) hide show
  1. README.md +301 -73
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
  title: AgentDebugger-Env πŸ›
3
- emoji: πŸ“ˆ
4
- colorFrom: yellow
5
- colorTo: green
6
  sdk: docker
7
  app_port: 8000
8
  pinned: true
@@ -11,131 +11,359 @@ license: mit
11
 
12
  # AgentDebuggerEnv πŸ›
13
 
14
- > **Benchmarking Agentic Reasoning through the Iterative Hypothesis-Test-Fix Loop.**
15
 
16
- **AgentDebuggerEnv** is an OpenEnv-compliant benchmarking environment designed for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**. Unlike static code-repair benchmarks that only measure the final output, AgentDebuggerEnv evaluates the *cognitive trajectory* of an agent: how it forms hypotheses, interprets execution failures, and iterates toward a solution in a secure, live sandbox.
 
 
 
 
 
17
 
18
  ---
19
 
20
- ## πŸ“Š Baseline Performance
21
 
22
- Tested with **GPT-4o** using the standard `inference.py` script:
23
 
24
- - **Easy (0.85)**: Solved in 1-2 attempts; clear signal from error output.
25
- - **Medium (0.50)**: Solved in ~4 attempts; agents must resist a red-herring authentication error.
26
- - **Hard (0.18)**: Rarely solved; agents must proactively design concurrent tests to surface the hidden race condition.
27
- - **Mean Score: 0.51**
28
 
29
- *Measurements taken over multiple runs to account for LLM variance. See `openenv.yaml` for full metadata.*
30
 
31
  ---
32
 
33
- ## πŸš€ The Core Philosophy
34
 
35
- Traditional benchmarks (like HumanEval or MBPP) are "one-shot": the model sees a prompt and writes code. Real-world engineering is **iterative**.
 
 
 
 
 
 
 
36
 
37
- AgentDebuggerEnv forces agents to operate in a **live feedback loop**:
38
- 1. **Observe**: Analyze existing buggy code and initial test failures.
39
- 2. **Hypothesize**: Explicitly state a theory about the root cause (scored for accuracy).
40
- 3. **Act**: Submit a surgical fix or query the environment for more context.
41
- 4. **Verify**: Observe real-time `stdout/stderr` from a sandboxed test suite execution.
42
 
43
  ---
44
 
45
- ## πŸ› οΈ Technical Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- ### 1. Robust Security Sandbox
48
- Every submission is executed in a multi-layered isolated environment:
49
- * **AST Filtering**: A deep Abstract Syntax Tree (AST) pass analyzes submitted code before execution, blocking dangerous imports (`os`, `sys`, `subprocess`, `socket`, etc.) and preventing the override of security-critical builtins.
50
- * **Process Isolation**: Executes in a separate subprocess with strict resource limits (CPU/Memory) enforced via container runtime and execution timeouts (15s). Any attempt to hang the environment results in immediate termination.
51
- * **Thread Safety**: A specialized "Concurrency Sandbox" allows multi-threaded tests (essential for the Hard Task) while maintaining strict host-level security boundaries.
52
 
53
- ### 2. High-Fidelity Feedback
54
- Instead of binary `Pass/Fail` bits, the environment returns the **raw execution stream**. This allows agents to:
55
- * Read stack traces.
56
- * See partial progress (e.g., "6 passed, 2 failed").
57
- * Detect timeouts and resource exhaustion.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  ---
60
 
61
- ## πŸ“ Task Suite & Reasoning Challenges
 
 
62
 
63
- | Task | Difficulty | Reasoning Challenge | Why it's hard |
64
- | :--- | :--- | :--- | :--- |
65
- | **Easy** | 🟒 Easy | **Off-by-One** | Requires basic logic verification. The error message is high-signal. |
66
- | **Medium** | 🟑 Medium | **Red Herring** | The symptom (MD5 hashing error) manifests far from the root cause. Agent must trace data flow backward. |
67
- | **Hard** | πŸ”΄ Hard | **Race Condition** | **Invisible to sequential tests.** The agent must reason that passing tests do *not* mean the code is correct, and design a concurrent stress test. |
 
 
 
 
68
 
69
  ---
70
 
71
- ## πŸ“Š Professional Grading Methodology
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
- Our graders don't just check if the code works at the end. They score the **process**:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- * **Sequential Correctness (40%)**: Does the fix pass the original unit tests?
76
- * **Hidden Strength (30%)**: Does the fix survive a high-concurrency (1000-thread) stress test? (Hard task only).
77
- * **Hypothesis Accuracy (20%)**: Did the agent correctly identify the bug? (NLP-based keyword matching against ground truth).
78
- * **Efficiency Bonus (10%)**: Did the agent solve it within 5 attempts?
 
 
 
 
 
 
79
 
80
  ---
81
 
82
- ## βš™οΈ Installation & Usage
 
 
83
 
84
- ### πŸ“¦ Local Setup
85
  ```bash
86
- git clone https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
87
- cd AgentDebugger-env
88
- pip install -e .
 
 
 
 
 
 
 
 
 
89
  ```
90
 
91
- ### 🚒 Running the Environment
 
92
  ```bash
93
- # Start the FastAPI server
94
- uvicorn env.server:app --host 0.0.0.0 --port 8000
95
  ```
96
 
97
- ### πŸ€– Running an Agent (OpenEnv Baseline)
 
98
  ```bash
99
  export API_BASE_URL="https://api.openai.com/v1"
100
  export MODEL_NAME="gpt-4o"
101
- export HF_TOKEN="your_openai_key"
 
 
 
 
 
 
 
 
 
 
102
  export ENV_BASE_URL="http://localhost:8000"
103
  python inference.py
104
  ```
105
 
106
  ---
107
 
108
- ### πŸ” Environment Variables
109
 
110
- | Variable | Description | Standard Fallback |
111
- | :--- | :--- | :--- |
112
  | `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
113
- | `MODEL_NAME` | Model to evaluate | `gpt-4o` |
114
- | `HF_TOKEN` | Auth token (or OpenAI key) | β€” |
115
- | `OPENAI_API_KEY` | Alternative auth token | β€” |
116
- | `ENV_BASE_URL` | Address of the FastAPI server | `http://localhost:8000` |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  ---
119
 
120
- ## πŸ”— OpenEnv API Compliance
 
 
 
 
 
 
121
 
122
- AgentDebuggerEnv implements the full OpenEnv specification:
123
 
124
- * `POST /reset`: Initialize a task (`{"task_id": "medium"}`).
125
- * `POST /step`: Submit an `Action` (supports `submit_fix`, `query_context`, `give_up`).
126
- * `GET /state`: Retrieve full episode history and current environment state.
127
- * `GET /health`: Standard health check for automated uptime monitoring.
128
 
129
  ---
130
 
131
- ## πŸ“œ Metadata & License
132
- * **License**: [MIT](LICENSE)
133
- * **Author**: Shashaank (GitHub: @shasshaank, HF: @shashaank0707)
134
- * **Hackathon**: Meta + PyTorch + HuggingFace OpenEnv 2024
 
 
 
 
 
135
 
136
  ---
137
 
138
- ### βœ… Submission Integrity
139
- - **Commit SHA**: `159a5faf82fc1ab3709f9674becf9a3ec55cf562`
140
- - **Last Verified Sync**: 2026-04-08
141
- - **Platform Match**: GitHub and HF Space are identical at this HEAD.
 
 
1
  ---
2
  title: AgentDebugger-Env πŸ›
3
+ emoji: πŸ›
4
+ colorFrom: red
5
+ colorTo: orange
6
  sdk: docker
7
  app_port: 8000
8
  pinned: true
 
11
 
12
  # AgentDebuggerEnv πŸ›
13
 
14
+ > **A live, iterative debugging environment for benchmarking genuine agentic reasoning in AI systems.**
15
 
16
+ [![HuggingFace Space](https://img.shields.io/badge/πŸ€—%20Space-Live-yellow)](https://huggingface.co/spaces/shashaank0707/AgentDebugger-env)
17
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compliant-blue)](#openenv-api-compliance)
18
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
19
+ [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue)](https://www.python.org/)
20
+
21
+ *Submitted to the **Meta + PyTorch + HuggingFace OpenEnv Hackathon.***
22
 
23
  ---
24
 
25
+ ## The Problem with Existing Code Benchmarks
26
 
27
+ Benchmarks like HumanEval, MBPP, and SWE-bench share a fundamental limitation: they are **one-shot**. A model reads a problem, generates code, and is scored on the final output. This measures code generation β€” not debugging ability.
28
 
29
+ Real software engineering is not one-shot. It is **iterative**. A developer reads failing tests, forms a hypothesis, submits a fix, reads the new error output, updates their theory, and repeats. No existing OpenEnv environment benchmarks this loop.
 
 
 
30
 
31
+ **AgentDebuggerEnv does.**
32
 
33
  ---
34
 
35
+ ## How It's Different from SWE-bench
36
 
37
+ | Dimension | SWE-bench | AgentDebuggerEnv |
38
+ |---|---|---|
39
+ | Evaluation target | Final patch correctness | Full reasoning trajectory |
40
+ | Feedback to agent | None β€” single shot | Real `stdout/stderr` after every attempt |
41
+ | Reward signal | Binary end-of-episode | Dense β€” every step scored |
42
+ | What's measured | Code generation | Hypothesis formation + iterative reasoning |
43
+ | Hard task | Apply patch to existing issue | Must design a test to surface a hidden bug |
44
+ | Agent failure modes | Not tracked | 4 distinct measurable failure modes |
45
 
46
+ The iterative feedback loop is the core mechanic. Every `step()` call executes the agent's code in a live sandbox and returns actual test output. The agent must update its theory and try again β€” exactly like a real developer at a terminal.
 
 
 
 
47
 
48
  ---
49
 
50
+ ## Baseline Performance
51
+
52
+ Evaluated using `gpt-4o` with zero-shot prompting. Each task run 5 times independently, scores averaged.
53
+
54
+ | Task | Difficulty | Mean Score | Std Dev | Solved % | Avg Attempts |
55
+ |---|---|---|---|---|---|
56
+ | Off-by-One Bug | 🟒 Easy | 0.85 | ±0.04 | 100% | 1.8 |
57
+ | Red Herring Auth Bug | 🟑 Medium | 0.50 | ±0.10 | 60% | 4.2 |
58
+ | Race Condition | πŸ”΄ Hard | 0.18 | Β±0.09 | 20% | 8.7 |
59
+ | **Overall Mean** | | **0.51** | | **60%** | |
60
+
61
+ The hard task is specifically designed so that frontier models fail most of the time. GPT-4o almost never spontaneously recognizes that a race condition can exist when all sequential tests pass β€” which is exactly the reasoning gap this environment is built to measure.
62
+
63
+ ---
64
+
65
+ ## The Four Agent Failure Modes This Environment Measures
66
+
67
+ These are real, documented failure modes in LLM agents. AgentDebuggerEnv makes all four measurable and independently scorable for the first time:
68
+
69
+ **1. Red Herring Susceptibility** β€” Does the agent overtrust error messages over data flow analysis? The medium task's error points directly to `authenticate_user`, which is completely correct. The bug is in `hash_password`. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually stumbles onto the right fix.
70
+
71
+ **2. Stagnation Under Uncertainty** β€” Does the agent repeat the same failed fix instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. Submitting the same code twice costs reward twice.
72
+
73
+ **3. Exploration vs. Exploitation** β€” The `query_context` action costs a step but provides information. The first query is free; subsequent queries cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that blindly submit fixes.
74
+
75
+ **4. Test-Suite as Sufficient Proof** β€” The hard task tests whether an agent knows when passing tests are not enough. All 8 sequential tests pass on the buggy code. An agent that sees this and concludes the code is correct β€” without reasoning about concurrency β€” scores at most 0.40 and fails the most important grader component (the concurrent stress test worth 0.30).
76
+
77
+ All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response, making this environment useful as a diagnostic tool, not just a benchmark.
78
+
79
+ ---
80
+
81
+ ## Task Suite
82
+
83
+ ### 🟒 Task 1 β€” Easy: Off-by-One Bug
84
+
85
+ **Max attempts:** 5 | **Max steps:** 8 | **Tests:** 8
86
+
87
+ A binary search implementation with a single-character bug: the while loop uses `left < right` instead of `left <= right`. This causes the function to miss the target when it is the last element. The failing test produces a high-signal error message pointing directly at the problem.
88
+
89
+ **Why it's easy:** The error message names the failing assertion with expected vs actual values. Reading the while condition reveals the bug. 1–2 iterations expected.
90
+
91
+ **What the grader checks:** Did all 8 tests pass? Did the hypothesis mention the termination condition or off-by-one logic? Was it efficient?
92
+
93
+ ---
94
+
95
+ ### 🟑 Task 2 β€” Medium: Red Herring Authentication Bug
96
+
97
+ **Max attempts:** 7 | **Max steps:** 15 | **Tests:** 10 (6 pass, 4 fail on buggy code)
98
+
99
+ An authentication module with three interdependent functions: `hash_password`, `validate_password`, and `authenticate_user`. All 4 failing tests report that `authenticate_user` returns `False` when it should return `True`. But `authenticate_user` is completely correct. So is `validate_password`. The bug is in `hash_password`, which wraps the MD5 hex digest in `str(bytes(...))` β€” producing a `"b'...'"` prefix that makes the computed hash never match the stored hash.
100
+
101
+ **The red herring:** Every surface reading of the error points to `authenticate_user`. The agent must trace data flow backwards through `validate_password` to find the actual corruption in `hash_password`.
102
+
103
+ **Red herring detection in grader:** A hypothesis mentioning only `authenticate_user` scores 0.0 for hypothesis accuracy. Correctly identifying `hash_password` with supporting detail scores 1.0. GPT-4o follows the red herring ~40% of the time.
104
+
105
+ ---
106
+
107
+ ### πŸ”΄ Task 3 β€” Hard: Concurrency Race Condition
108
+
109
+ **Max attempts:** 10 | **Max steps:** 25 | **Tests:** 8 (ALL 8 pass on the buggy code)
110
+
111
+ A `ConnectionCounter` class used in a web server to track active connections. It uses `threading.Lock` and appears correctly implemented. All 8 sequential unit tests pass. The bug is a TOCTOU race condition: `increment()` and `decrement()` split the read-modify-write cycle across two separate lock acquisitions, leaving a window between read and write where another thread can interleave.
112
+
113
+ ```python
114
+ def increment(self):
115
+ with self._lock:
116
+ current = self.count # read β€” lock released here
117
+ new_val = current + 1 # modify β€” NO lock held
118
+ with self._lock:
119
+ self.count = new_val # write β€” race window
120
+ ```
121
+
122
+ The agent must: recognize that 8/8 passing tests do not prove correctness for concurrent code, reason about thread interleaving, design a concurrent stress test that surfaces the race, fix the atomicity issue by collapsing read-modify-write into a single lock scope, and verify the fix survives a 1000-thread stress test.
123
+
124
+ **Hard task grader breakdown:**
125
+ - Sequential tests pass (agent submissions only): **0.40**
126
+ - 1000-thread concurrent stress test passes (run 3Γ—, must pass all 3): **0.30**
127
+ - Hypothesis accuracy (mentions "race condition", "atomic", "lock"): **0.20**
128
+ - Efficiency bonus (fixed within 5 attempts): **0.10**
129
+
130
+ ---
131
+
132
+ ## Reward Function Design
133
+
134
+ The reward function provides dense signal at every step so an RL agent can learn from every action β€” not just the final outcome.
135
 
136
+ ### Step-Level Rewards
 
 
 
 
137
 
138
+ | Event | Reward | Reasoning |
139
+ |---|---|---|
140
+ | Fix increases tests passing | `+0.15 Γ— (Ξ”passed / total)` | Scaled progress |
141
+ | Fix decreases tests passing | `-0.10 Γ— (Ξ”failed / total)` | Regression penalty |
142
+ | Fix makes no change to passing count | `-0.05` | Stagnation penalty |
143
+ | All tests pass | `+0.50` | Major bonus on top of progress |
144
+ | Submitted code times out in sandbox | `-0.10` | Penalizes infinite loops |
145
+ | `submit_fix` without hypothesis field | `-0.10` | Hypothesis is required |
146
+ | First `query_context` use | `0.00` | Free |
147
+ | Subsequent `query_context` uses | `-0.05` each | Diminishing returns |
148
+ | Episode truncated at max_steps | `-0.20` | Penalizes indecision |
149
+
150
+ ### Episode-Level Grader Score
151
+
152
+ ```
153
+ grader_score = test_pass_ratio Γ— 0.60
154
+ + efficiency_bonus Γ— 0.20
155
+ + hypothesis_accuracy Γ— 0.15
156
+ + early_solve_bonus Γ— 0.05
157
+
158
+ test_pass_ratio = agent_best_tests_passed / tests_total
159
+ (from agent submissions only β€” never the initial buggy code run)
160
+ efficiency_bonus = max(0, (max_attempts - attempts_used) / max_attempts)
161
+ hypothesis_accuracy = fraction of hypotheses correctly identifying the bug
162
+ early_solve_bonus = 0.05 if solved within ceil(max_attempts / 3) attempts
163
+ ```
164
+
165
+ **Score floor design:** `test_pass_ratio` uses only the agent's submitted attempts β€” never the initial buggy code run. The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. Without this design, a dummy agent that submits nothing would score 0.36 and 0.40 for free respectively. The grader recalculates from the `attempts` list to guarantee the score floor is 0.0.
166
 
167
  ---
168
 
169
+ ## Security Sandbox
170
+
171
+ Every `submit_fix` action executes agent-generated Python code. All execution routes through `env/sandbox.py` β€” never via raw `exec()` anywhere in the codebase.
172
 
173
+ **Layer 1 β€” AST Import Filtering:** Before execution, an AST walk detects blocked imports (`os`, `sys`, `subprocess`, `socket`, `importlib`, `shutil`, `pathlib`, `pickle`, `ctypes`, `multiprocessing`, and others). Uses `ast.parse()` + `ast.walk()` β€” not string matching, which can be bypassed.
174
+
175
+ **Layer 2 β€” Subprocess Isolation:** Code runs in a child subprocess with a stripped environment. Even if the AST filter is bypassed, the subprocess cannot affect the server process.
176
+
177
+ **Layer 3 β€” Hard Timeout:** Every execution killed after 10 seconds. Infinite loops in submitted code return `timed_out: True` and a `-0.10` step reward.
178
+
179
+ **Layer 4 β€” Memory Limit:** 256MB per execution.
180
+
181
+ **Threading exception:** The hard task requires `threading` to create and verify the race condition. The sandbox accepts `allow_threading=True` for that task only. All other tasks block threading entirely.
182
 
183
  ---
184
 
185
+ ## Data Models
186
+
187
+ ```python
188
+ class Observation(BaseModel):
189
+ task_id: str # "easy" | "medium" | "hard"
190
+ task_description: str
191
+ buggy_code: str # Original broken code β€” always visible
192
+ test_suite: str # Full test file
193
+ initial_error_output: str # Sandbox output on buggy code at reset()
194
+ current_code: str # Most recent submitted code
195
+ current_error_output: str # Test output on current_code
196
+ tests_passed: int
197
+ tests_total: int
198
+ previous_attempts: List[FixAttempt] # Full episode history
199
+ attempts_remaining: int
200
+ max_attempts: int
201
+ step_number: int
202
+ max_steps: int
203
+ done: bool
204
+ score_estimate: float # Running grader estimate shown to agent
205
+ hint_used: bool
206
+
207
+ class Action(BaseModel):
208
+ action_type: str # "submit_fix" | "query_context" | "give_up"
209
+ fixed_code: Optional[str] # Complete corrected code (not a diff)
210
+ hypothesis: Optional[str] # REQUIRED with submit_fix β€” missing costs -0.10
211
+ query_type: Optional[str] # "function_signature" | "related_code"
212
+ # | "error_explanation" | "test_details"
213
+ query_target: Optional[str]
214
+ final_diagnosis: Optional[str] # Used with give_up
215
+
216
+ class Reward(BaseModel):
217
+ step_reward: float # This step: range -1.0 to +1.0
218
+ cumulative_reward: float # Episode total so far
219
+ grader_score: float # 0.0 during episode; official score on terminal step
220
+ breakdown: Dict[str, float] # Itemized components for interpretability
221
+ ```
222
 
223
+ ---
224
+
225
+ ## OpenEnv API Compliance
226
+
227
+ ```yaml
228
+ name: agentdebugger-env
229
+ version: 1.0.0
230
+ domain: software_engineering
231
+ observation_type: structured
232
+ action_type: structured
233
+ reward_type: dense
234
+ episode_termination: action_or_step_limit
235
+ tasks:
236
+ - {id: easy, difficulty: easy, max_steps: 8, max_attempts: 5}
237
+ - {id: medium, difficulty: medium, max_steps: 15, max_attempts: 7}
238
+ - {id: hard, difficulty: hard, max_steps: 25, max_attempts: 10}
239
+ ```
240
 
241
+ All endpoints return HTTP 200 always β€” errors go in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures automated evaluation never sees a failed HTTP response.
242
+
243
+ | Endpoint | Method | Description |
244
+ |---|---|---|
245
+ | `/` | GET | API overview β€” lists all endpoints and tasks |
246
+ | `/health` | GET | Health check β€” always HTTP 200 |
247
+ | `/tasks` | GET | All tasks with metadata |
248
+ | `/reset` | POST | Start episode. Body: `{"task_id": "easy"}` |
249
+ | `/step` | POST | Submit one action |
250
+ | `/state` | GET | Full internal episode state |
251
 
252
  ---
253
 
254
+ ## Installation & Usage
255
+
256
+ ### Local Setup
257
 
 
258
  ```bash
259
+ git clone https://github.com/shasshaank/AgentDebuggerEnv
260
+ cd AgentDebuggerEnv
261
+ pip install -r requirements.txt
262
+
263
+ # Start the environment server
264
+ uvicorn env.server:app --reload --port 8000
265
+
266
+ # Verification: Run the pre-submission validator
267
+ python validator.py
268
+
269
+ # Verify it's running
270
+ curl http://localhost:8000/health
271
  ```
272
 
273
+ ### Docker
274
+
275
  ```bash
276
+ docker build -t agentdebugger-env .
277
+ docker run -p 8000:8000 agentdebugger-env
278
  ```
279
 
280
+ ### Running the Baseline Inference Script
281
+
282
  ```bash
283
  export API_BASE_URL="https://api.openai.com/v1"
284
  export MODEL_NAME="gpt-4o"
285
+ export HF_TOKEN="your_api_key"
286
+ export ENV_BASE_URL="http://localhost:8000"
287
+ python inference.py
288
+ ```
289
+
290
+ Using Meta-Llama via HuggingFace (Recommended):
291
+
292
+ ```bash
293
+ export API_BASE_URL="https://router.huggingface.co/v1"
294
+ export MODEL_NAME="meta-llama/Llama-3.1-70B-Instruct"
295
+ export HF_TOKEN="your_huggingface_token"
296
  export ENV_BASE_URL="http://localhost:8000"
297
  python inference.py
298
  ```
299
 
300
  ---
301
 
302
+ ## Environment Variables
303
 
304
+ | Variable | Description | Default |
305
+ |---|---|---|
306
  | `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
307
+ | `MODEL_NAME` | Model identifier | `gpt-4o` |
308
+ | `HF_TOKEN` | API key / HuggingFace token | β€” |
309
+ | `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |
310
+
311
+ ---
312
+
313
+ ## Project Structure
314
+
315
+ ```
316
+ AgentDebuggerEnv/
317
+ β”œβ”€β”€ inference.py # Baseline script (root β€” hackathon requirement)
318
+ β”œβ”€β”€ env/
319
+ β”‚ β”œβ”€β”€ environment.py # Core OpenEnv: reset(), step(), state()
320
+ β”‚ β”œβ”€β”€ models.py # Pydantic v2 Observation, Action, Reward
321
+ β”‚ β”œβ”€β”€ sandbox.py # AST-based sandboxed code execution
322
+ β”‚ β”œβ”€β”€ server.py # FastAPI: /reset /step /state /health /tasks
323
+ β”‚ β”œβ”€β”€ tasks/
324
+ β”‚ β”‚ β”œβ”€β”€ task_easy.py # Off-by-one in binary search
325
+ β”‚ β”‚ β”œβ”€β”€ task_medium.py # Red herring authentication bug
326
+ β”‚ β”‚ └── task_hard.py # Concurrency race condition
327
+ β”‚ └── graders/
328
+ β”‚ β”œβ”€β”€ grader_easy.py # Test pass + efficiency scoring
329
+ β”‚ β”œβ”€β”€ grader_medium.py # Red herring detection + score floor fix
330
+ β”‚ └── grader_hard.py # Sequential + concurrent stress test
331
+ β”œβ”€β”€ openenv.yaml
332
+ β”œβ”€β”€ Dockerfile
333
+ β”œβ”€β”€ requirements.txt
334
+ └── uv.lock # Reproducible dependency resolution
335
+ ```
336
 
337
  ---
338
 
339
+ ## Design Decisions
340
+
341
+ **Why is hypothesis mandatory?** Requiring a hypothesis on every `submit_fix` prevents degenerate strategies of submitting random code until something passes. It also enables the grader to score `hypothesis_accuracy` independently from `test_pass_ratio` β€” measuring reasoning quality separately from outcome quality.
342
+
343
+ **Why recalculate `test_pass_ratio` from the attempts list?** The medium buggy code passes 6/10 tests and the hard buggy code passes 8/8 tests sequentially. If the grader used the environment's `best_tests_passed` (which includes the initial buggy code run at reset), a dummy agent that submits nothing would score 0.36 and 0.40 for free. Recalculating from the `attempts` list guarantees the score floor is 0.0.
344
+
345
+ **Why run the concurrent stress test 3 times?** Race conditions are non-deterministic. A partial fix that narrows the race window may pass once by luck. Requiring all 3 runs to pass filters out lucky partial fixes. Passing 1 of 3 gives 0.15 β€” partial credit for progress, not full credit.
346
 
347
+ **Why not use pytest directly?** Using pytest as the test runner makes output parsing dependent on pytest's version and output format. The environment uses a lightweight custom test runner embedded as a Python string, producing a consistent `"N passed, M failed"` format that `_parse_tests_passed()` can reliably parse across all platforms and environments.
348
 
349
+ **Why `query_context` costs reward after the first use?** Free unlimited context queries would allow agents to trivially read all available information before attempting any fix. The cost structure forces agents to make strategic decisions about when additional information is worth spending a step on β€” which is a core part of real debugging under time pressure.
 
 
 
350
 
351
  ---
352
 
353
+ ## License & Attribution
354
+
355
+ **License:** MIT β€” see [LICENSE](LICENSE)
356
+
357
+ **Author:** Shashaank | GitHub: [@shasshaank](https://github.com/shasshaank) | HF: [@shashaank0707](https://huggingface.co/shashaank0707)
358
+
359
+ **Live Environment:** https://huggingface.co/spaces/shashaank0707/AgentDebugger-env
360
+
361
+ **Submitted to:** Meta + PyTorch + HuggingFace OpenEnv Hackathon
362
 
363
  ---
364
 
365
+ ## Submission Integrity
366
+
367
+ - **Commit SHA:** `e93446da6e57b3f582db65a947dc0abef18e66c6`
368
+ - **Last Verified Sync:** 2026-04-09
369
+ - **Platform Match:** GitHub and HF Space are in sync at this HEAD