shank commited on
Commit
6cca39d
Β·
1 Parent(s): 5c507c3

Update: Refined and validated values

Browse files
Files changed (2) hide show
  1. README.md +24 -41
  2. openenv.yaml +1 -1
README.md CHANGED
@@ -62,19 +62,12 @@ The hard task is specifically designed so that frontier models fail most of the
62
 
63
  ---
64
 
65
- ## The Four Agent Failure Modes This Environment Measures
66
 
67
- These are real, documented failure modes in LLM agents. AgentDebuggerEnv makes all four measurable and independently scorable for the first time:
68
-
69
- **1. Red Herring Susceptibility** β€” Does the agent overtrust error messages over data flow analysis? The medium task's error points directly to `authenticate_user`, which is completely correct. The bug is in `hash_password`. An agent that follows the red herring scores 0.0 on hypothesis accuracy even if it eventually stumbles onto the right fix.
70
-
71
- **2. Stagnation Under Uncertainty** β€” Does the agent repeat the same failed fix instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. Submitting the same code twice costs reward twice.
72
-
73
- **3. Exploration vs. Exploitation** β€” The `query_context` action costs a step but provides information. The first query is free; subsequent queries cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that blindly submit fixes.
74
-
75
- **4. Test-Suite as Sufficient Proof** β€” The hard task tests whether an agent knows when passing tests are not enough. All 8 sequential tests pass on the buggy code. An agent that sees this and concludes the code is correct β€” without reasoning about concurrency β€” scores at most 0.40 and fails the most important grader component (the concurrent stress test worth 0.30).
76
-
77
- All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response, making this environment useful as a diagnostic tool, not just a benchmark.
78
 
79
  ---
80
 
@@ -187,37 +180,27 @@ Every `submit_fix` action executes agent-generated Python code. All execution ro
187
  ```python
188
  class Observation(BaseModel):
189
  task_id: str # "easy" | "medium" | "hard"
190
- task_description: str
191
- buggy_code: str # Original broken code β€” always visible
192
- test_suite: str # Full test file
193
- initial_error_output: str # Sandbox output on buggy code at reset()
194
- current_code: str # Most recent submitted code
195
- current_error_output: str # Test output on current_code
196
- tests_passed: int
197
- tests_total: int
198
- previous_attempts: List[FixAttempt] # Full episode history
199
  attempts_remaining: int
200
  max_attempts: int
201
- step_number: int
202
- max_steps: int
203
  done: bool
204
- score_estimate: float # Running grader estimate shown to agent
205
- hint_used: bool
206
 
207
  class Action(BaseModel):
208
- action_type: str # "submit_fix" | "query_context" | "give_up"
209
- fixed_code: Optional[str] # Complete corrected code (not a diff)
210
- hypothesis: Optional[str] # REQUIRED with submit_fix β€” missing costs -0.10
211
- query_type: Optional[str] # "function_signature" | "related_code"
212
- # | "error_explanation" | "test_details"
213
- query_target: Optional[str]
214
- final_diagnosis: Optional[str] # Used with give_up
215
 
216
  class Reward(BaseModel):
217
- step_reward: float # This step: range -1.0 to +1.0
218
- cumulative_reward: float # Episode total so far
219
- grader_score: float # 0.0 during episode; official score on terminal step
220
- breakdown: Dict[str, float] # Itemized components for interpretability
221
  ```
222
 
223
  ---
@@ -238,7 +221,7 @@ tasks:
238
  - {id: hard, difficulty: hard, max_steps: 25, max_attempts: 10}
239
  ```
240
 
241
- All endpoints return HTTP 200 always β€” errors go in the response body under `info["error"]`, never as HTTP 4xx/5xx. This ensures automated evaluation never sees a failed HTTP response.
242
 
243
  | Endpoint | Method | Description |
244
  |---|---|---|
@@ -303,9 +286,9 @@ python inference.py
303
 
304
  | Variable | Description | Default |
305
  |---|---|---|
306
- | `API_BASE_URL` | LLM API endpoint | `https://api.openai.com/v1` |
307
- | `MODEL_NAME` | Model identifier | `gpt-4o` |
308
- | `HF_TOKEN` | API key / HuggingFace token | β€” |
309
  | `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |
310
 
311
  ---
@@ -364,6 +347,6 @@ AgentDebuggerEnv/
364
 
365
  ## Submission Integrity
366
 
367
- - **Commit SHA:** `e93446da6e57b3f582db65a947dc0abef18e66c6`
368
  - **Last Verified Sync:** 2026-04-09
369
  - **Platform Match:** GitHub and HF Space are in sync at this HEAD
 
62
 
63
  ---
64
 
65
+ All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response:
66
 
67
+ * **Red Herring Susceptibility**: Does the agent overtrust error messages (Medium Task symptom) or trace data flow to the root?
68
+ * **Stagnation**: Does the agent repeat failed fixes? Prohibited by the `-0.05` stagnation penalty.
69
+ * **Exploration/Exploitation**: Measures if agents query for context productively before attempting fixes.
70
+ * **Test-Suite Overconfidence**: Detects if an agent fails to reason about concurrency when sequential tests pass (Hard Task).
 
 
 
 
 
 
 
71
 
72
  ---
73
 
 
180
  ```python
181
  class Observation(BaseModel):
182
  task_id: str # "easy" | "medium" | "hard"
183
+ buggy_code: str # Original broken code
184
+ test_suite: str # Full test file content
185
+ current_code: str # Most recent submitted code
186
+ current_error_output: str # Sandbox stdout/stderr output
187
+ tests_passed: int
 
 
 
 
188
  attempts_remaining: int
189
  max_attempts: int
 
 
190
  done: bool
191
+ score_estimate: float # Running grader estimate
 
192
 
193
  class Action(BaseModel):
194
+ action_type: str # "submit_fix" | "query_context" | "give_up"
195
+ fixed_code: Optional[str] # Complete corrected code
196
+ hypothesis: Optional[str] # Theory about the bug (required for submit)
197
+ query_type: Optional[str] # "function_signature" | "error_explanation" etc.
 
 
 
198
 
199
  class Reward(BaseModel):
200
+ step_reward: float # Dense signal: range -1.0 to +1.0
201
+ cumulative_reward: float
202
+ grader_score: float # Official score (terminal step only)
203
+ breakdown: Dict[str, float] # Itemized components
204
  ```
205
 
206
  ---
 
221
  - {id: hard, difficulty: hard, max_steps: 25, max_attempts: 10}
222
  ```
223
 
224
+ Application-level errors are returned in `info.error` inside the response body. Core evaluation endpoints are designed to avoid 4xx/5xx status codes for agent-level mistakes, ensuring the evaluation flow is never interrupted by network-level exceptions.
225
 
226
  | Endpoint | Method | Description |
227
  |---|---|---|
 
286
 
287
  | Variable | Description | Default |
288
  |---|---|---|
289
+ | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
290
+ | `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-70B-Instruct` |
291
+ | `HF_TOKEN` | Hugging Face Token (Read) | β€” |
292
  | `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |
293
 
294
  ---
 
347
 
348
  ## Submission Integrity
349
 
350
+ - **Commit SHA:** `5c507c313ff2c209d7b770af6f08cf6ed6ab1568`
351
  - **Last Verified Sync:** 2026-04-09
352
  - **Platform Match:** GitHub and HF Space are in sync at this HEAD
openenv.yaml CHANGED
@@ -54,7 +54,7 @@ baseline:
54
  medium: 0.50
55
  hard: 0.18
56
  author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
57
- # Submission Integrity: SHA e93446da6e57b3f582db65a947dc0abef18e66c6 | Verified 2026-04-09
58
  license: MIT
59
  huggingface_space: shashaank0707/AgentDebugger-env
60
  api_base_url_env_var: API_BASE_URL
 
54
  medium: 0.50
55
  hard: 0.18
56
  author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
57
+ # Submission Integrity: SHA 5c507c313ff2c209d7b770af6f08cf6ed6ab1568 | Verified 2026-04-09
58
  license: MIT
59
  huggingface_space: shashaank0707/AgentDebugger-env
60
  api_base_url_env_var: API_BASE_URL