shank commited on
Commit
e766743
·
1 Parent(s): 865526d

deleted implementation plan

Browse files
Files changed (1) hide show
  1. implementation_plan.md +0 -187
implementation_plan.md DELETED
@@ -1,187 +0,0 @@
1
- # AgentDebuggerEnv — Implementation Plan
2
-
3
- An OpenEnv-compliant debugging environment where AI agents fix broken code through iterative hypothesis-test-fix cycles. Submission for the **Meta + PyTorch + HuggingFace OpenEnv Hackathon**.
4
-
5
- ## User Review Required
6
-
7
- > [!IMPORTANT]
8
- > This is a large project with **15+ files** to create. The entire codebase needs to be built from scratch (only the README exists currently). Please confirm you'd like me to proceed with the full implementation.
9
-
10
- > [!WARNING]
11
- > The README specifies `huggingface_space: shashaank/agentdebugger-env`. You'll need to create this HuggingFace Space and deploy the Docker container there for the hackathon submission. I'll build everything locally; deployment is a manual step.
12
-
13
- ## Proposed Changes
14
-
15
- The implementation follows the exact order from the README's Section 14 checklist. Each step depends on the previous.
16
-
17
- ---
18
-
19
- ### Step 1: Sandbox (`env/sandbox.py`) — Build & Test First
20
-
21
- This is the most security-critical component. Every code execution goes through here.
22
-
23
- #### [NEW] [sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/sandbox.py)
24
-
25
- - `execute_code(code, test_code, allow_threading=False) → (str, bool, int)`
26
- - AST-based import detection (not string matching) to block dangerous imports
27
- - `BLOCKED_IMPORTS` list: os, sys, subprocess, socket, importlib, shutil, pathlib, glob, pickle, shelve, dbm, sqlite3, ftplib, http, urllib, requests, httpx, asyncio, multiprocessing, threading (unless `allow_threading=True`), ctypes, cffi, resource, signal, mmap, gc
28
- - Write code + test_code to a temp file, run in subprocess with `timeout=10`
29
- - Capture merged stdout+stderr
30
- - Clean up temp files in `finally` block
31
-
32
- #### [NEW] [test_sandbox.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_sandbox.py)
33
-
34
- - 5 required tests: timeout, os blocked, sys blocked, clean code runs, syntax error returns output
35
-
36
- ---
37
-
38
- ### Step 2: Data Models
39
-
40
- #### [NEW] [models.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/models.py)
41
-
42
- - `FixAttempt`, `Observation`, `Action`, `Reward` — all Pydantic v2 BaseModel subclasses
43
- - Exact field names and types from README Section 3
44
-
45
- ---
46
-
47
- ### Step 3: Task Definitions
48
-
49
- #### [NEW] [task_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_easy.py)
50
-
51
- - Binary search with `<` instead of `<=` bug
52
- - 8-test suite, 7 pass initially, 1 fails (last element)
53
- - Ground truth: `hypothesis_keywords`: ["left <= right", "termination", "last element", "off by one", "<="]
54
-
55
- #### [NEW] [task_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_medium.py)
56
-
57
- - `hash_password`, `validate_password`, `authenticate_user` — bug is in `hash_password`
58
- - 10-test suite, 6 pass, 4 fail (edge cases with hash mismatch)
59
- - Red herring: error points to `authenticate_user` but bug is in `hash_password`
60
- - Hypothesis must mention "hash_password" AND at least 1 other keyword
61
-
62
- #### [NEW] [task_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/task_hard.py)
63
-
64
- - `ConnectionCounter` with race condition in `increment()`/`decrement()`
65
- - 8 sequential tests all pass on buggy code
66
- - Bug only surfaces under concurrent access
67
- - `allow_threading=True` for this task
68
-
69
- #### [NEW] [registry.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/tasks/registry.py)
70
-
71
- - Maps `"easy"` / `"medium"` / `"hard"` → task config dict (buggy_code, test_suite, description, ground_truth, max_attempts, max_steps)
72
-
73
- #### [NEW] [`__init__.py` files](file:///Users/shashaankjain/Desktop/meta_hackathon/env/__init__.py)
74
-
75
- - `env/__init__.py` and `env/tasks/__init__.py`
76
-
77
- ---
78
-
79
- ### Step 4: Graders
80
-
81
- #### [NEW] [base_grader.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/base_grader.py)
82
-
83
- - Abstract base class with `score()` method
84
-
85
- #### [NEW] [grader_easy.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_easy.py)
86
-
87
- - Standard formula: 0.60 test_pass_ratio + 0.20 efficiency + 0.15 hypothesis + 0.05 early_solve
88
-
89
- #### [NEW] [grader_medium.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_medium.py)
90
-
91
- - Same formula but with red herring detection: hypothesis mentioning only "authenticate_user" scores 0.0
92
-
93
- #### [NEW] [grader_hard.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/graders/grader_hard.py)
94
-
95
- - Custom weights: 0.40 original tests + 0.30 concurrent stress test + 0.20 hypothesis + 0.10 efficiency
96
- - Runs a 1000-thread concurrent stress test against submitted code
97
-
98
- #### [NEW] [test_graders.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_graders.py)
99
-
100
- - Determinism tests (same input → same output)
101
- - Range tests (output always in [0.0, 1.0])
102
-
103
- ---
104
-
105
- ### Step 5: Environment Core
106
-
107
- #### [NEW] [environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/environment.py)
108
-
109
- - `DebuggerEnvironment` class with `reset(task_id)`, `step(action)`, `state()` methods
110
- - `reset()`: loads task, runs buggy code through sandbox to get initial error output
111
- - `step()`: routes by `action_type` — submit_fix → sandbox, query_context → return info, give_up → run grader
112
- - All action rules from Section 3.2 implemented exactly
113
- - Step-level reward calculation per Section 6.1
114
- - Episode-level grader invocation on `done=True`
115
- - Never crashes — all errors returned in `info["error"]`
116
-
117
- #### [NEW] [test_environment.py](file:///Users/shashaankjain/Desktop/meta_hackathon/tests/test_environment.py)
118
-
119
- - Unit tests for reset/step/state
120
-
121
- ---
122
-
123
- ### Step 6: FastAPI Server
124
-
125
- #### [NEW] [server.py](file:///Users/shashaankjain/Desktop/meta_hackathon/env/server.py)
126
-
127
- - `POST /reset` — body: `{"task_id": "easy"}`, returns Observation JSON
128
- - `POST /step` — body: Action JSON, returns `{"observation", "reward", "done", "info"}`
129
- - `GET /state` — returns full state dict
130
- - `GET /health` — returns `{"status": "ok", "environment": "agentdebugger-env", "version": "1.0.0"}` with HTTP 200
131
-
132
- ---
133
-
134
- ### Step 7: Inference Script
135
-
136
- #### [NEW] [inference.py](file:///Users/shashaankjain/Desktop/meta_hackathon/inference.py)
137
-
138
- - Exact code from README Section 8 — already fully specified
139
- - Root directory placement (not in `env/`)
140
- - Reads env vars: `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, `ENV_BASE_URL`
141
- - Uses `openai` Python client
142
- - Saves `baseline_results.json`
143
-
144
- ---
145
-
146
- ### Step 8: Configuration & Deployment
147
-
148
- #### [NEW] [openenv.yaml](file:///Users/shashaankjain/Desktop/meta_hackathon/openenv.yaml)
149
-
150
- - Exact content from README Section 9
151
-
152
- #### [NEW] [Dockerfile](file:///Users/shashaankjain/Desktop/meta_hackathon/Dockerfile)
153
-
154
- - Exact content from README Section 10
155
-
156
- #### [NEW] [requirements.txt](file:///Users/shashaankjain/Desktop/meta_hackathon/requirements.txt)
157
-
158
- - Exact content from README Section 11
159
-
160
- ---
161
-
162
- ## Open Questions
163
-
164
- > [!IMPORTANT]
165
- > **Task Medium — The Hash Bug:** The README describes a bytes/str conversion bug in `hash_password` where `str()` wrapping adds `"b'"` prefix. I need to carefully design the `user_db` and test setup so that 6 tests pass and exactly 4 fail. The README leaves the exact test suite design for medium to the implementer. I'll design it to match the described behavior. Any preferences?
166
-
167
- > [!IMPORTANT]
168
- > **Hard Task Test Count:** The README says `tests_total: 8` for hard in `openenv.yaml`, but the hard task has 8 sequential tests (all pass) and the agent needs to design a concurrent test. The grader independently runs its own 1000-thread stress test. I'll keep `tests_total: 8` as the initial suite and the grader adds its own concurrent verification separately. Correct?
169
-
170
- ## Verification Plan
171
-
172
- ### Automated Tests
173
- 1. `pytest tests/test_sandbox.py -v` — All 5 sandbox tests pass
174
- 2. `pytest tests/test_graders.py -v` — Determinism and range tests pass
175
- 3. `pytest tests/test_environment.py -v` — Reset/step/state tests pass
176
- 4. Start server with `uvicorn env.server:app --port 8000`, then:
177
- - `curl http://localhost:8000/health` → 200 with correct JSON
178
- - POST `/reset` for each task → valid Observation
179
- - POST `/step` with various actions → correct responses
180
- 5. Variance self-check:
181
- - Dummy agent (submits `pass`) → scores < 0.15
182
- - Perfect agent (ground truth fix + correct hypothesis) → scores > 0.85 on easy
183
-
184
- ### Manual Verification
185
- - Docker build: `docker build -t agentdebugger-env .`
186
- - Docker run and health check
187
- - User deploys to HuggingFace Space and runs `openenv validate .`