File size: 24,257 Bytes
9e64e71 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 | # Blog Material — Raw Knowledge Dump
Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.
## Training Run Summary
### Run progression (what each run taught us)
1. **Run 1**: SFT works, GRPO plateaus — no penalty for post-episode waste
2. **Run 2**: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
4. **Run 4**: KL penalty + reference model = OOM on L4
5. **Run 5**: KL too conservative with single-turn SFT — model only calls describe, never queries
6. **Run 6**: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed
8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
9. **Run 9**: v2 continued training confirms ceiling — more epochs don't help medium questions
### Key numbers
| Metric | Value |
|--------|-------|
| Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
| Hardware | Colab L4 (24GB VRAM) |
| SFT data | 120 multi-turn trajectories, assistant_only_loss |
| Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
| Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
| Total training | ~7.3h on L4 for both phases |
| Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) |
| Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
| Parse rate | >98% (valid tool-call JSON) |
| Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
| Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
| Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |
## Concrete Episode Examples (for blog)
### Success: Simple aggregation (3 tool calls)
Question: "What is the total bonus given in all evaluations?"
1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns
2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0
3. `answer("19500.0")` → correct
Reward: 1.15
### Success: Error recovery with JOIN
Question: "Which employee received the biggest bonus?"
1. `describe("employee")` → Employee_ID, Name, Age, City
2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary
3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus
4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon
5. `answer("Louis Deacon")` → correct
Reward: 1.13
### Success: Subquery with comparison
Question: "How many cars have greater accelerate than the one with the most horsepower?"
1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39
3. `answer("39")` → correct
Reward: 1.15
### Failure: Column name hallucination
Question: "What are all the makers and models?"
1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model
3. `describe("model_list")` → ModelId, Maker, Model
4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows)
5. `answer("1 | amc\n2 | audi\n3 | bmw...")` → incorrect (truncated, missing rows)
Model found the right data but answer was truncated by the 20-row display limit.
### Failure: FK chain reasoning
Question: "Which city has most number of arriving flights?"
1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev
2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` → Albany | 4
3. `answer("Albany")` → incorrect
Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.
### Failure: Invented table
Question: "What are all details of students who registered but did not attend any course?"
1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date
2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space)
3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance
Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.
## Reward Architecture Details
### Three-layer structure
```
L1 Operational (every step):
+0.02 exec_ok
+0.01 new_info (unique SQL hash)
-0.03 repeat penalty
-0.02 step cost
L2 Progress (QUERY only):
Delta from previous binned progress × 0.15
Binned to {0, 0.25, 0.5, 0.75, 1.0}
L3 Terminal (ANSWER only):
+1.0 correct, 0.0 wrong
Per-step clip: [-0.10, 0.15]
```
### Why potential-based shaping matters
- Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
- Our delta progress IS potential-based with γ=1
- Cumulative caps are NOT potential-based (depend on trajectory history)
- Without this guarantee, agents learn to farm exploration rewards
### Anti-farming mechanisms
- Hard budget (15 steps)
- Step cost (-0.02)
- Repeat penalty (-0.03)
- Terminal dominance (1.0 vs ~0.3 max exploration)
- Per-step clip [-0.10, 0.15]
- Post-episode penalty (-0.3)
## Eval Results (N=50, 2026-04-11)
### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
|--------|----------|------------|-----------|------------|-------------|----------------|
| zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
| 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
| 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
| GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
| GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |
### Previous run (Run A, same day, same seed)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
|--------|----------|------------|-----------|------------|----------------|
| zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
| 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
| 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
| GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
| GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |
### Run-to-run variation (important for blog)
v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.
The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.
Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).
### Key observations from N=50 eval (with retry, 2 runs)
1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
2. **Run-to-run variation is ~6-8pp** — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
3. **Base model parse failure loop** — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
4. **GRPO solves format compliance** — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
5. **GRPO failure mode is SQL quality, not format** — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
6. **Extra turns don't help base models** — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
7. **1-shot occasionally gets lucky** — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.
### v2 vs v1: similar accuracy, more parse failures — behavioral shift
Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:
- **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
- **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
- **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.
The v2 parse failure pattern (from raw output):
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete. ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete. ← repeats until budget
```
Compare v1 on the same type of question:
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42 ← wrong, but at least calls answer()
```
This is a form of **calibrated uncertainty** — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.
### For the blog narrative
The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.
## Format Mismatch Discovery (F011)
### The three differences between eval and training
1. **role:tool vs role:user** — Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure.
2. **Structured tool_calls vs raw text** — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
3. **No separator vs `\n\n`** — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.
### Impact
Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.
### Lesson
Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.
## Multi-Turn SFT — Why It's Critical
### Per-turn SFT (broken)
- 347 examples, each one assistant turn
- ~50% were describe calls
- Model learned: "when asked a question, call describe"
- With KL penalty, model stayed anchored to this single-action policy
- Result: reward=0.00, all rollouts identical, advantage=0
### Multi-turn SFT (working)
- 120 examples, each a full describe→query→answer trajectory
- `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
- Model learned: the SEQUENCE describe→query→answer
- With KL penalty, model explores within the multi-turn strategy
- Result: reward climbs to 0.7 in Phase 1
### Template patch detail
Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).
## The 0.6B Capacity Ceiling
### What works at 0.6B
- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
- Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
- Error recovery: describe after SQL error, retry with correct column names
- `sample` tool usage (learned in Run 6, inconsistent later)
### What breaks at 0.6B
- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
- Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
- Ambiguous column resolution: joins with same column name in both tables
- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
- "stadium without concert" pattern: NOT IN with JOIN to get names
- Aggregate + GROUP BY + HAVING chains on multi-table joins
### The hallucination pattern
The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.
## Thinking Mode Observations (Run 8)
### Benefits
- Reasons through SQL errors: "no such column: airport_code" → `<think>` block → tries `AirportCode`
- Empty `<think></think>` on easy questions — token-efficient, emergent behavior
- Multi-step join planning in think blocks
### New failure mode
~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.
### OOM risk
Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.
## Narrative Threads for Blog
### "The environment is the product"
From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.
### "Small model showing improvement proves more than large model with marginal gains"
A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.
### "Analysts don't write perfect queries from scratch"
The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.
### "Dense rewards need theory"
Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.
### "Multi-turn SFT teaches strategy, not actions"
The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.
## References for Blog
- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
- DeepSeek-AI (2025). DeepSeek-R1.
- Shao et al. (2024). DeepSeek-Math: GRPO.
- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
- Yu et al. (2018). Spider dataset.
- Li et al. (2023). BIRD benchmark.
- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
- ToolRL (2025). Reward is All Tool Learning Needs.
- StepTool (2024). Step-grained RL for Tool Learning.
## Showcase Notebook Transcripts (for blog)
### Random agent episode (seed=7) — comedic failure
Question: "Count the number of paragraphs."
```
SAMPLE Paragraphs → reward=0.015
SAMPLE Documents → reward=0.015
DESCRIBE Documents → reward=0.015
SAMPLE Documents → reward=0.015 (repeat)
DESCRIBE Documents → reward=0.015 (repeat)
DESCRIBE Documents → reward=0.015 (repeat)
DESCRIBE Templates → reward=0.015
SAMPLE Documents → reward=0.015 (repeat)
DESCRIBE Documents → reward=0.015 (repeat)
QUERY SELECT * FROM "Templates" LIMIT 5 → reward=0.0625
DESCRIBE Documents → reward=0.015 (repeat)
DESCRIBE Paragraphs → reward=0.015
QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
QUERY SELECT * FROM "Documents" LIMIT 5 → reward=0.025
ANSWER 76 | 20 | Robbin CV | y | None → reward=0.000 (incorrect)
```
Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.
### Oracle agent episode (seed=0) — clean solve
Question: "List the id of students who registered some courses and the number of their registered courses?"
```
Step 1: DESCRIBE student_course_registrations
→ student_id INTEGER, course_id INTEGER, registration_date DATETIME
→ reward: +0.015
Step 2: DESCRIBE students
→ student_id INTEGER, student_details VARCHAR(255)
→ reward: +0.015
Step 3: QUERY
SELECT T1.student_id, count(*)
FROM students AS T1
JOIN student_course_registrations AS T2
ON T1.student_id = T2.student_id
GROUP BY T1.student_id
→ 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
→ reward: +0.150
Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
→ correct
→ reward: +1.000
```
Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.
### Baseline comparison (50 episodes each)
| Policy | Success Rate | Avg Reward | Avg Steps |
|--------|-------------|------------|-----------|
| Random | 0.0% | 0.247 | 15.0 |
| Oracle | 100.0% | 1.168 | 3.5 |
The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.
### Reward constants (from server/reward.py)
```
+0.02 successful execution (no errors)
+0.01 new information (unique query)
-0.02 step cost (every action)
-0.03 repeat penalty (duplicate SQL)
[-0.10, +0.15] per-step clipping range
+1.0 correct answer (terminal)
+0.0 wrong answer (terminal)
```
Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.
## Competition Context
### OpenEnv Challenge (our target)
- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
- Prize: $10K HF credits
- Judging: primarily blog-based
- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.
### Deliverables
1. Environment on HF Hub — **live** at https://huggingface.co/spaces/hjerpe/sql_env
(pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`,
`notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft)
### TRL integration status (already done — do not re-research)
`training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
class: `reset()` + named tool methods `describe() / sample() / query() /
answer()` with docstrings TRL uses to build the tool schema. The notebook
passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
**in-process** (not a WebSocket client to the HF Space) — intentional, because
training opens N parallel sessions and the Space defaults to 1.
### Competitive landscape
- **SQL Repair** (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
- **Calendar Gym** (Turing) — featured on HF blog, real-world framing + failure analysis
- **OpenSec** — cybersecurity with arXiv paper, adversarial evidence injection
- Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.
### What winning entries do
1. Stakes framing — "this matters in production"
2. Concrete failure analysis with numbers
3. Contrast (random vs trained vs oracle)
4. Real data, not toy puzzles
5. Non-obvious insights from training
## Green Agent Evaluator
### What it is
OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.
### Implementation
- `evaluation/policies.py` — `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
- `evaluation/oracle_policy.py` — `OraclePolicy` baseline (runs gold SQL)
- `tests/test_evaluation.py` — 17 tests, all passing (unit + integration)
### How it works
```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)
```
### Where it's used
- `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison
- `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`
### Key design choices
- **Error isolation**: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))`
- **Deterministic seeding**: `seed + episode_index` per episode for reproducibility
- **Protocol-based**: any class with `select_action(observation) -> action` works — no inheritance required
- **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down
### For the blog
The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.
## Files to Reference
| File | Relevance |
|------|-----------|
| `docs/blog-outline.md` | Section structure template |
| `docs/blog-post.md` | Current draft |
| `docs/design-docs/reward-shaping-research.md` | Reward theory + references |
| `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed |
| `vision/VISION.md` | Product vision, success metrics |
| `training/trl_adapter.py` | Environment adapter code |
| `notebooks/compare_methods.ipynb` | Eval notebook |
| `notebooks/train_grpo.ipynb` | Training notebook |
|