shank commited on
Commit Β·
6cca39d
1
Parent(s): 5c507c3
Update: Refined and validated values
Browse files- README.md +24 -41
- openenv.yaml +1 -1
README.md
CHANGED
|
@@ -62,19 +62,12 @@ The hard task is specifically designed so that frontier models fail most of the
|
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
**
|
| 70 |
-
|
| 71 |
-
**2. Stagnation Under Uncertainty** β Does the agent repeat the same failed fix instead of updating its hypothesis? The `-0.05` stagnation penalty and `hypothesis_accuracy` score together capture this. Submitting the same code twice costs reward twice.
|
| 72 |
-
|
| 73 |
-
**3. Exploration vs. Exploitation** β The `query_context` action costs a step but provides information. The first query is free; subsequent queries cost `-0.05`. Agents that query productively before attempting a fix demonstrate better exploration behavior than those that blindly submit fixes.
|
| 74 |
-
|
| 75 |
-
**4. Test-Suite as Sufficient Proof** β The hard task tests whether an agent knows when passing tests are not enough. All 8 sequential tests pass on the buggy code. An agent that sees this and concludes the code is correct β without reasoning about concurrency β scores at most 0.40 and fails the most important grader component (the concurrent stress test worth 0.30).
|
| 76 |
-
|
| 77 |
-
All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response, making this environment useful as a diagnostic tool, not just a benchmark.
|
| 78 |
|
| 79 |
---
|
| 80 |
|
|
@@ -187,37 +180,27 @@ Every `submit_fix` action executes agent-generated Python code. All execution ro
|
|
| 187 |
```python
|
| 188 |
class Observation(BaseModel):
|
| 189 |
task_id: str # "easy" | "medium" | "hard"
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
current_error_output: str # Test output on current_code
|
| 196 |
-
tests_passed: int
|
| 197 |
-
tests_total: int
|
| 198 |
-
previous_attempts: List[FixAttempt] # Full episode history
|
| 199 |
attempts_remaining: int
|
| 200 |
max_attempts: int
|
| 201 |
-
step_number: int
|
| 202 |
-
max_steps: int
|
| 203 |
done: bool
|
| 204 |
-
score_estimate: float
|
| 205 |
-
hint_used: bool
|
| 206 |
|
| 207 |
class Action(BaseModel):
|
| 208 |
-
action_type: str
|
| 209 |
-
fixed_code: Optional[str]
|
| 210 |
-
hypothesis: Optional[str]
|
| 211 |
-
query_type: Optional[str]
|
| 212 |
-
# | "error_explanation" | "test_details"
|
| 213 |
-
query_target: Optional[str]
|
| 214 |
-
final_diagnosis: Optional[str] # Used with give_up
|
| 215 |
|
| 216 |
class Reward(BaseModel):
|
| 217 |
-
step_reward: float
|
| 218 |
-
cumulative_reward: float
|
| 219 |
-
grader_score: float
|
| 220 |
-
breakdown: Dict[str, float]
|
| 221 |
```
|
| 222 |
|
| 223 |
---
|
|
@@ -238,7 +221,7 @@ tasks:
|
|
| 238 |
- {id: hard, difficulty: hard, max_steps: 25, max_attempts: 10}
|
| 239 |
```
|
| 240 |
|
| 241 |
-
|
| 242 |
|
| 243 |
| Endpoint | Method | Description |
|
| 244 |
|---|---|---|
|
|
@@ -303,9 +286,9 @@ python inference.py
|
|
| 303 |
|
| 304 |
| Variable | Description | Default |
|
| 305 |
|---|---|---|
|
| 306 |
-
| `API_BASE_URL` | LLM API endpoint | `https://
|
| 307 |
-
| `MODEL_NAME` | Model identifier | `
|
| 308 |
-
| `HF_TOKEN` |
|
| 309 |
| `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |
|
| 310 |
|
| 311 |
---
|
|
@@ -364,6 +347,6 @@ AgentDebuggerEnv/
|
|
| 364 |
|
| 365 |
## Submission Integrity
|
| 366 |
|
| 367 |
-
- **Commit SHA:** `
|
| 368 |
- **Last Verified Sync:** 2026-04-09
|
| 369 |
- **Platform Match:** GitHub and HF Space are in sync at this HEAD
|
|
|
|
| 62 |
|
| 63 |
---
|
| 64 |
|
| 65 |
+
All four failure modes produce distinct, interpretable score components in the `breakdown` field of every `Reward` response:
|
| 66 |
|
| 67 |
+
* **Red Herring Susceptibility**: Does the agent overtrust error messages (Medium Task symptom) or trace data flow to the root?
|
| 68 |
+
* **Stagnation**: Does the agent repeat failed fixes? Prohibited by the `-0.05` stagnation penalty.
|
| 69 |
+
* **Exploration/Exploitation**: Measures if agents query for context productively before attempting fixes.
|
| 70 |
+
* **Test-Suite Overconfidence**: Detects if an agent fails to reason about concurrency when sequential tests pass (Hard Task).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
---
|
| 73 |
|
|
|
|
| 180 |
```python
|
| 181 |
class Observation(BaseModel):
|
| 182 |
task_id: str # "easy" | "medium" | "hard"
|
| 183 |
+
buggy_code: str # Original broken code
|
| 184 |
+
test_suite: str # Full test file content
|
| 185 |
+
current_code: str # Most recent submitted code
|
| 186 |
+
current_error_output: str # Sandbox stdout/stderr output
|
| 187 |
+
tests_passed: int
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
attempts_remaining: int
|
| 189 |
max_attempts: int
|
|
|
|
|
|
|
| 190 |
done: bool
|
| 191 |
+
score_estimate: float # Running grader estimate
|
|
|
|
| 192 |
|
| 193 |
class Action(BaseModel):
|
| 194 |
+
action_type: str # "submit_fix" | "query_context" | "give_up"
|
| 195 |
+
fixed_code: Optional[str] # Complete corrected code
|
| 196 |
+
hypothesis: Optional[str] # Theory about the bug (required for submit)
|
| 197 |
+
query_type: Optional[str] # "function_signature" | "error_explanation" etc.
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
class Reward(BaseModel):
|
| 200 |
+
step_reward: float # Dense signal: range -1.0 to +1.0
|
| 201 |
+
cumulative_reward: float
|
| 202 |
+
grader_score: float # Official score (terminal step only)
|
| 203 |
+
breakdown: Dict[str, float] # Itemized components
|
| 204 |
```
|
| 205 |
|
| 206 |
---
|
|
|
|
| 221 |
- {id: hard, difficulty: hard, max_steps: 25, max_attempts: 10}
|
| 222 |
```
|
| 223 |
|
| 224 |
+
Application-level errors are returned in `info.error` inside the response body. Core evaluation endpoints are designed to avoid 4xx/5xx status codes for agent-level mistakes, ensuring the evaluation flow is never interrupted by network-level exceptions.
|
| 225 |
|
| 226 |
| Endpoint | Method | Description |
|
| 227 |
|---|---|---|
|
|
|
|
| 286 |
|
| 287 |
| Variable | Description | Default |
|
| 288 |
|---|---|---|
|
| 289 |
+
| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
|
| 290 |
+
| `MODEL_NAME` | Model identifier | `meta-llama/Llama-3.1-70B-Instruct` |
|
| 291 |
+
| `HF_TOKEN` | Hugging Face Token (Read) | β |
|
| 292 |
| `ENV_BASE_URL` | Environment server address | `http://localhost:8000` |
|
| 293 |
|
| 294 |
---
|
|
|
|
| 347 |
|
| 348 |
## Submission Integrity
|
| 349 |
|
| 350 |
+
- **Commit SHA:** `5c507c313ff2c209d7b770af6f08cf6ed6ab1568`
|
| 351 |
- **Last Verified Sync:** 2026-04-09
|
| 352 |
- **Platform Match:** GitHub and HF Space are in sync at this HEAD
|
openenv.yaml
CHANGED
|
@@ -54,7 +54,7 @@ baseline:
|
|
| 54 |
medium: 0.50
|
| 55 |
hard: 0.18
|
| 56 |
author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
|
| 57 |
-
# Submission Integrity: SHA
|
| 58 |
license: MIT
|
| 59 |
huggingface_space: shashaank0707/AgentDebugger-env
|
| 60 |
api_base_url_env_var: API_BASE_URL
|
|
|
|
| 54 |
medium: 0.50
|
| 55 |
hard: 0.18
|
| 56 |
author: "Shashaank (GitHub: @shasshaank, HF: @shashaank0707)"
|
| 57 |
+
# Submission Integrity: SHA 5c507c313ff2c209d7b770af6f08cf6ed6ab1568 | Verified 2026-04-09
|
| 58 |
license: MIT
|
| 59 |
huggingface_space: shashaank0707/AgentDebugger-env
|
| 60 |
api_base_url_env_var: API_BASE_URL
|