shank commited on
Commit ·
e766743
1
Parent(s): 865526d
deleted implementation plan
Browse files- implementation_plan.md +0 -187
implementation_plan.md
DELETED
|
@@ -1,187 +0,0 @@
|
|
| 1 |
-
# AgentDebuggerEnv — Implementation Plan
|
| 2 |
-
|
| 3 |
-
An OpenEnv-compliant debugging environment where AI agents fix broken code through iterative hypothesis-test-fix cycles. Submission for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
|
| 4 |
-
|
| 5 |
-
## User Review Required
|
| 6 |
-
|
| 7 |
-
> [!IMPORTANT]
|
| 8 |
-
> This is a large project with **15+ files** to create. The entire codebase needs to be built from scratch (only the README exists currently). Please confirm you'd like me to proceed with the full implementation.
|
| 9 |
-
|
| 10 |
-
> [!WARNING]
|
| 11 |
-
> The README specifies `huggingface_space: shashaank/agentdebugger-env`. You'll need to create this HuggingFace Space and deploy the Docker container there for the hackathon submission. I'll build everything locally; deployment is a manual step.
|
| 12 |
-
|
| 13 |
-
## Proposed Changes
|
| 14 |
-
|
| 15 |
-
The implementation follows the exact order from the README's Section 14 checklist. Each step depends on the previous.
|
| 16 |
-
|
| 17 |
-
---
|
| 18 |
-
|
| 19 |
-
### Step 1: Sandbox (`env/sandbox.py`) — Build & Test First
|
| 20 |
-
|
| 21 |
-
This is the most security-critical component. Every code execution goes through here.
|
| 22 |
-
|
| 23 |
-
#### [NEW] [sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/sandbox.py)
|
| 24 |
-
|
| 25 |
-
- `execute_code(code, test_code, allow_threading=False) → (str, bool, int)`
|
| 26 |
-
- AST-based import detection (not string matching) to block dangerous imports
|
| 27 |
-
- `BLOCKED_IMPORTS` list: os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, shelve, dbm, sqlite3, ftplib, http, urllib, requests, httpx, asyncio, multiprocessing, threading (unless `allow_threading=True`), ctypes, cffi, resource, signal, mmap, gc
|
| 28 |
-
- Write code + test_code to a temp file, run in subprocess with `timeout=10`
|
| 29 |
-
- Capture merged stdout+stderr
|
| 30 |
-
- Clean up temp files in `finally` block
|
| 31 |
-
|
| 32 |
-
#### [NEW] [test_sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_sandbox.py)
|
| 33 |
-
|
| 34 |
-
- 5 required tests: timeout, os blocked, sys blocked, clean code runs, syntax error returns output
|
| 35 |
-
|
| 36 |
-
---
|
| 37 |
-
|
| 38 |
-
### Step 2: Data Models
|
| 39 |
-
|
| 40 |
-
#### [NEW] [models.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/models.py)
|
| 41 |
-
|
| 42 |
-
- `FixAttempt`, `Observation`, `Action`, `Reward` — all Pydantic v2 BaseModel subclasses
|
| 43 |
-
- Exact field names and types from README Section 3
|
| 44 |
-
|
| 45 |
-
---
|
| 46 |
-
|
| 47 |
-
### Step 3: Task Definitions
|
| 48 |
-
|
| 49 |
-
#### [NEW] [task_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_easy.py)
|
| 50 |
-
|
| 51 |
-
- Binary search with `<` instead of `<=` bug
|
| 52 |
-
- 8-test suite, 7 pass initially, 1 fails (last element)
|
| 53 |
-
- Ground truth: `hypothesis_keywords`: ["left <= right", "termination", "last element", "off by one", "<="]
|
| 54 |
-
|
| 55 |
-
#### [NEW] [task_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_medium.py)
|
| 56 |
-
|
| 57 |
-
- `hash_password`, `validate_password`, `authenticate_user` — bug is in `hash_password`
|
| 58 |
-
- 10-test suite, 6 pass, 4 fail (edge cases with hash mismatch)
|
| 59 |
-
- Red herring: error points to `authenticate_user` but bug is in `hash_password`
|
| 60 |
-
- Hypothesis must mention "hash_password" AND at least 1 other keyword
|
| 61 |
-
|
| 62 |
-
#### [NEW] [task_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_hard.py)
|
| 63 |
-
|
| 64 |
-
- `ConnectionCounter` with race condition in `increment()`/`decrement()`
|
| 65 |
-
- 8 sequential tests all pass on buggy code
|
| 66 |
-
- Bug only surfaces under concurrent access
|
| 67 |
-
- `allow_threading=True` for this task
|
| 68 |
-
|
| 69 |
-
#### [NEW] [registry.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/registry.py)
|
| 70 |
-
|
| 71 |
-
- Maps `"easy"` / `"medium"` / `"hard"` → task config dict (buggy_code, test_suite, description, ground_truth, max_attempts, max_steps)
|
| 72 |
-
|
| 73 |
-
#### [NEW] [`__init__.py` files](file:///Users/shashaankjain/Desktop/meta_hackathon/env/__init__.py)
|
| 74 |
-
|
| 75 |
-
- `env/__init__.py` and `env/tasks/__init__.py`
|
| 76 |
-
|
| 77 |
-
---
|
| 78 |
-
|
| 79 |
-
### Step 4: Graders
|
| 80 |
-
|
| 81 |
-
#### [NEW] [base_grader.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/base_grader.py)
|
| 82 |
-
|
| 83 |
-
- Abstract base class with `score()` method
|
| 84 |
-
|
| 85 |
-
#### [NEW] [grader_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_easy.py)
|
| 86 |
-
|
| 87 |
-
- Standard formula: 0.60 test_pass_ratio + 0.20 efficiency + 0.15 hypothesis + 0.05 early_solve
|
| 88 |
-
|
| 89 |
-
#### [NEW] [grader_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_medium.py)
|
| 90 |
-
|
| 91 |
-
- Same formula but with red herring detection: hypothesis mentioning only "authenticate_user" scores 0.0
|
| 92 |
-
|
| 93 |
-
#### [NEW] [grader_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_hard.py)
|
| 94 |
-
|
| 95 |
-
- Custom weights: 0.40 original tests + 0.30 concurrent stress test + 0.20 hypothesis + 0.10 efficiency
|
| 96 |
-
- Runs a 1000-thread concurrent stress test against submitted code
|
| 97 |
-
|
| 98 |
-
#### [NEW] [test_graders.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_graders.py)
|
| 99 |
-
|
| 100 |
-
- Determinism tests (same input → same output)
|
| 101 |
-
- Range tests (output always in [0.0, 1.0])
|
| 102 |
-
|
| 103 |
-
---
|
| 104 |
-
|
| 105 |
-
### Step 5: Environment Core
|
| 106 |
-
|
| 107 |
-
#### [NEW] [environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/environment.py)
|
| 108 |
-
|
| 109 |
-
- `DebuggerEnvironment` class with `reset(task_id)`, `step(action)`, `state()` methods
|
| 110 |
-
- `reset()`: loads task, runs buggy code through sandbox to get initial error output
|
| 111 |
-
- `step()`: routes by `action_type` — submit_fix → sandbox, query_context → return info, give_up → run grader
|
| 112 |
-
- All action rules from Section 3.2 implemented exactly
|
| 113 |
-
- Step-level reward calculation per Section 6.1
|
| 114 |
-
- Episode-level grader invocation on `done=True`
|
| 115 |
-
- Never crashes — all errors returned in `info["error"]`
|
| 116 |
-
|
| 117 |
-
#### [NEW] [test_environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_environment.py)
|
| 118 |
-
|
| 119 |
-
- Unit tests for reset/step/state
|
| 120 |
-
|
| 121 |
-
---
|
| 122 |
-
|
| 123 |
-
### Step 6: FastAPI Server
|
| 124 |
-
|
| 125 |
-
#### [NEW] [server.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/server.py)
|
| 126 |
-
|
| 127 |
-
- `POST /reset` — body: `{"task_id": "easy"}`, returns Observation JSON
|
| 128 |
-
- `POST /step` — body: Action JSON, returns `{"observation", "reward", "done", "info"}`
|
| 129 |
-
- `GET /state` — returns full state dict
|
| 130 |
-
- `GET /health` — returns `{"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}` with HTTP 200
|
| 131 |
-
|
| 132 |
-
---
|
| 133 |
-
|
| 134 |
-
### Step 7: Inference Script
|
| 135 |
-
|
| 136 |
-
#### [NEW] [inference.py](file:///Users/shashaankjain/Desktop/meta_hackathon/inference.py)
|
| 137 |
-
|
| 138 |
-
- Exact code from README Section 8 — already fully specified
|
| 139 |
-
- Root directory placement (not in `env/`)
|
| 140 |
-
- Reads env vars: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, `ENV_BASE_URL`
|
| 141 |
-
- Uses `openai` Python client
|
| 142 |
-
- Saves `baseline_results.json`
|
| 143 |
-
|
| 144 |
-
---
|
| 145 |
-
|
| 146 |
-
### Step 8: Configuration & Deployment
|
| 147 |
-
|
| 148 |
-
#### [NEW] [openenv.yaml](file:///Users/shashaankjain/Desktop/meta_hackathon/openenv.yaml)
|
| 149 |
-
|
| 150 |
-
- Exact content from README Section 9
|
| 151 |
-
|
| 152 |
-
#### [NEW] [Dockerfile](file:///Users/shashaankjain/Desktop/meta_hackathon/Dockerfile)
|
| 153 |
-
|
| 154 |
-
- Exact content from README Section 10
|
| 155 |
-
|
| 156 |
-
#### [NEW] [requirements.txt](file:///Users/shashaankjain/Desktop/meta_hackathon/requirements.txt)
|
| 157 |
-
|
| 158 |
-
- Exact content from README Section 11
|
| 159 |
-
|
| 160 |
-
---
|
| 161 |
-
|
| 162 |
-
## Open Questions
|
| 163 |
-
|
| 164 |
-
> [!IMPORTANT]
|
| 165 |
-
> **Task Medium — The Hash Bug:** The README describes a bytes/str conversion bug in `hash_password` where `str()` wrapping adds `"b'"` prefix. I need to carefully design the `user_db` and test setup so that 6 tests pass and exactly 4 fail. The README leaves the exact test suite design for medium to the implementer. I'll design it to match the described behavior. Any preferences?
|
| 166 |
-
|
| 167 |
-
> [!IMPORTANT]
|
| 168 |
-
> **Hard Task Test Count:** The README says `tests_total: 8` for hard in `openenv.yaml`, but the hard task has 8 sequential tests (all pass) and the agent needs to design a concurrent test. The grader independently runs its own 1000-thread stress test. I'll keep `tests_total: 8` as the initial suite and the grader adds its own concurrent verification separately. Correct?
|
| 169 |
-
|
| 170 |
-
## Verification Plan
|
| 171 |
-
|
| 172 |
-
### Automated Tests
|
| 173 |
-
1. `pytest tests/test_sandbox.py -v` — All 5 sandbox tests pass
|
| 174 |
-
2. `pytest tests/test_graders.py -v` — Determinism and range tests pass
|
| 175 |
-
3. `pytest tests/test_environment.py -v` — Reset/step/state tests pass
|
| 176 |
-
4. Start server with `uvicorn env.server:app --port 8000`, then:
|
| 177 |
-
- `curl http://localhost:8000/health` → 200 with correct JSON
|
| 178 |
-
- POST `/reset` for each task → valid Observation
|
| 179 |
-
- POST `/step` with various actions → correct responses
|
| 180 |
-
5. Variance self-check:
|
| 181 |
-
- Dummy agent (submits `pass`) → scores < 0.15
|
| 182 |
-
- Perfect agent (ground truth fix + correct hypothesis) → scores > 0.85 on easy
|
| 183 |
-
|
| 184 |
-
### Manual Verification
|
| 185 |
-
- Docker build: `docker build -t agentdebugger-env .`
|
| 186 |
-
- Docker run and health check
|
| 187 |
-
- User deploys to HuggingFace Space and runs `openenv validate .`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|