File size: 24,257 Bytes
9e64e71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
# Blog Material — Raw Knowledge Dump

Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.

## Training Run Summary

### Run progression (what each run taught us)
1. **Run 1**: SFT works, GRPO plateaus — no penalty for post-episode waste
2. **Run 2**: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
4. **Run 4**: KL penalty + reference model = OOM on L4
5. **Run 5**: KL too conservative with single-turn SFT — model only calls describe, never queries
6. **Run 6**: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed
8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
9. **Run 9**: v2 continued training confirms ceiling — more epochs don't help medium questions

### Key numbers
| Metric | Value |
|--------|-------|
| Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
| Hardware | Colab L4 (24GB VRAM) |
| SFT data | 120 multi-turn trajectories, assistant_only_loss |
| Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
| Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
| Total training | ~7.3h on L4 for both phases |
| Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) |
| Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
| Parse rate | >98% (valid tool-call JSON) |
| Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
| Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
| Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |

## Concrete Episode Examples (for blog)

### Success: Simple aggregation (3 tool calls)
Question: "What is the total bonus given in all evaluations?"
1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns
2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0
3. `answer("19500.0")` → correct
Reward: 1.15

### Success: Error recovery with JOIN
Question: "Which employee received the biggest bonus?"
1. `describe("employee")` → Employee_ID, Name, Age, City
2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary
3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus
4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon
5. `answer("Louis Deacon")` → correct
Reward: 1.13

### Success: Subquery with comparison
Question: "How many cars have greater accelerate than the one with the most horsepower?"
1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39
3. `answer("39")` → correct
Reward: 1.15

### Failure: Column name hallucination
Question: "What are all the makers and models?"
1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model
3. `describe("model_list")` → ModelId, Maker, Model
4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows)
5. `answer("1 | amc\n2 | audi\n3 | bmw...")` → incorrect (truncated, missing rows)
Model found the right data but answer was truncated by the 20-row display limit.

### Failure: FK chain reasoning
Question: "Which city has most number of arriving flights?"
1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev
2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` → Albany | 4
3. `answer("Albany")` → incorrect
Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.

### Failure: Invented table
Question: "What are all details of students who registered but did not attend any course?"
1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date
2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space)
3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance
Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.

## Reward Architecture Details

### Three-layer structure
```
L1 Operational (every step):
  +0.02  exec_ok
  +0.01  new_info (unique SQL hash)
  -0.03  repeat penalty
  -0.02  step cost

L2 Progress (QUERY only):
  Delta from previous binned progress × 0.15
  Binned to {0, 0.25, 0.5, 0.75, 1.0}

L3 Terminal (ANSWER only):
  +1.0 correct, 0.0 wrong

Per-step clip: [-0.10, 0.15]
```

### Why potential-based shaping matters
- Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
- Our delta progress IS potential-based with γ=1
- Cumulative caps are NOT potential-based (depend on trajectory history)
- Without this guarantee, agents learn to farm exploration rewards

### Anti-farming mechanisms
- Hard budget (15 steps)
- Step cost (-0.02)
- Repeat penalty (-0.03)
- Terminal dominance (1.0 vs ~0.3 max exploration)
- Per-step clip [-0.10, 0.15]
- Post-episode penalty (-0.3)

## Eval Results (N=50, 2026-04-11)

### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
|--------|----------|------------|-----------|------------|-------------|----------------|
| zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
| 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
| 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
| GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
| GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |

### Previous run (Run A, same day, same seed)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
|--------|----------|------------|-----------|------------|----------------|
| zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
| 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
| 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
| GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
| GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |

### Run-to-run variation (important for blog)
v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.

The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.

Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).

### Key observations from N=50 eval (with retry, 2 runs)
1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
2. **Run-to-run variation is ~6-8pp** — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
3. **Base model parse failure loop** — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
4. **GRPO solves format compliance** — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
5. **GRPO failure mode is SQL quality, not format** — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
6. **Extra turns don't help base models** — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
7. **1-shot occasionally gets lucky** — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.

### v2 vs v1: similar accuracy, more parse failures — behavioral shift
Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:

- **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
- **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
- **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.

The v2 parse failure pattern (from raw output):
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete.    ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete.    ← repeats until budget
```

Compare v1 on the same type of question:
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42    ← wrong, but at least calls answer()
```

This is a form of **calibrated uncertainty** — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.

### For the blog narrative
The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.

## Format Mismatch Discovery (F011)

### The three differences between eval and training
1. **role:tool vs role:user** — Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure.
2. **Structured tool_calls vs raw text** — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
3. **No separator vs `\n\n`** — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.

### Impact
Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.

### Lesson
Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.

## Multi-Turn SFT — Why It's Critical

### Per-turn SFT (broken)
- 347 examples, each one assistant turn
- ~50% were describe calls
- Model learned: "when asked a question, call describe"
- With KL penalty, model stayed anchored to this single-action policy
- Result: reward=0.00, all rollouts identical, advantage=0

### Multi-turn SFT (working)
- 120 examples, each a full describe→query→answer trajectory
- `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
- Model learned: the SEQUENCE describe→query→answer
- With KL penalty, model explores within the multi-turn strategy
- Result: reward climbs to 0.7 in Phase 1

### Template patch detail
Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).

## The 0.6B Capacity Ceiling

### What works at 0.6B
- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
- Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
- Error recovery: describe after SQL error, retry with correct column names
- `sample` tool usage (learned in Run 6, inconsistent later)

### What breaks at 0.6B
- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
- Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
- Ambiguous column resolution: joins with same column name in both tables
- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
- "stadium without concert" pattern: NOT IN with JOIN to get names
- Aggregate + GROUP BY + HAVING chains on multi-table joins

### The hallucination pattern
The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.

## Thinking Mode Observations (Run 8)

### Benefits
- Reasons through SQL errors: "no such column: airport_code" → `<think>` block → tries `AirportCode`
- Empty `<think></think>` on easy questions — token-efficient, emergent behavior
- Multi-step join planning in think blocks

### New failure mode
~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.

### OOM risk
Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.

## Narrative Threads for Blog

### "The environment is the product"
From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.

### "Small model showing improvement proves more than large model with marginal gains"
A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.

### "Analysts don't write perfect queries from scratch"
The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.

### "Dense rewards need theory"
Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.

### "Multi-turn SFT teaches strategy, not actions"
The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.

## References for Blog

- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
- DeepSeek-AI (2025). DeepSeek-R1.
- Shao et al. (2024). DeepSeek-Math: GRPO.
- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
- Yu et al. (2018). Spider dataset.
- Li et al. (2023). BIRD benchmark.
- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
- ToolRL (2025). Reward is All Tool Learning Needs.
- StepTool (2024). Step-grained RL for Tool Learning.

## Showcase Notebook Transcripts (for blog)

### Random agent episode (seed=7) — comedic failure
Question: "Count the number of paragraphs."
```
  SAMPLE Paragraphs         → reward=0.015
  SAMPLE Documents           → reward=0.015
  DESCRIBE Documents         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Templates         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  QUERY SELECT * FROM "Templates" LIMIT 5  → reward=0.0625
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Paragraphs        → reward=0.015
  QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
  QUERY SELECT * FROM "Documents" LIMIT 5  → reward=0.025
  ANSWER 76 | 20 | Robbin CV | y | None   → reward=0.000 (incorrect)
```
Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.

### Oracle agent episode (seed=0) — clean solve
Question: "List the id of students who registered some courses and the number of their registered courses?"
```
  Step 1: DESCRIBE student_course_registrations
    → student_id INTEGER, course_id INTEGER, registration_date DATETIME
    → reward: +0.015

  Step 2: DESCRIBE students
    → student_id INTEGER, student_details VARCHAR(255)
    → reward: +0.015

  Step 3: QUERY
    SELECT T1.student_id, count(*)
    FROM students AS T1
    JOIN student_course_registrations AS T2
      ON T1.student_id = T2.student_id
    GROUP BY T1.student_id
    → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
    → reward: +0.150

  Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
    → correct
    → reward: +1.000
```
Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.

### Baseline comparison (50 episodes each)
| Policy | Success Rate | Avg Reward | Avg Steps |
|--------|-------------|------------|-----------|
| Random | 0.0% | 0.247 | 15.0 |
| Oracle | 100.0% | 1.168 | 3.5 |

The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.

### Reward constants (from server/reward.py)
```
+0.02   successful execution (no errors)
+0.01   new information (unique query)
-0.02   step cost (every action)
-0.03   repeat penalty (duplicate SQL)
[-0.10, +0.15]  per-step clipping range
+1.0    correct answer (terminal)
+0.0    wrong answer (terminal)
```
Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.

## Competition Context

### OpenEnv Challenge (our target)
- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
- Prize: $10K HF credits
- Judging: primarily blog-based
- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.

### Deliverables
1. Environment on HF Hub — **live** at https://huggingface.co/spaces/hjerpe/sql_env
   (pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`,
   `notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft)

### TRL integration status (already done — do not re-research)
`training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
class: `reset()` + named tool methods `describe() / sample() / query() /
answer()` with docstrings TRL uses to build the tool schema. The notebook
passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
**in-process** (not a WebSocket client to the HF Space) — intentional, because
training opens N parallel sessions and the Space defaults to 1.

### Competitive landscape
- **SQL Repair** (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
- **Calendar Gym** (Turing) — featured on HF blog, real-world framing + failure analysis
- **OpenSec** — cybersecurity with arXiv paper, adversarial evidence injection
- Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.

### What winning entries do
1. Stakes framing — "this matters in production"
2. Concrete failure analysis with numbers
3. Contrast (random vs trained vs oracle)
4. Real data, not toy puzzles
5. Non-obvious insights from training

## Green Agent Evaluator

### What it is
OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.

### Implementation
- `evaluation/policies.py``Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
- `evaluation/oracle_policy.py``OraclePolicy` baseline (runs gold SQL)
- `tests/test_evaluation.py` — 17 tests, all passing (unit + integration)

### How it works
```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)
```

### Where it's used
- `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison
- `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`

### Key design choices
- **Error isolation**: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))`
- **Deterministic seeding**: `seed + episode_index` per episode for reproducibility
- **Protocol-based**: any class with `select_action(observation) -> action` works — no inheritance required
- **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down

### For the blog
The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.

## Files to Reference

| File | Relevance |
|------|-----------|
| `docs/blog-outline.md` | Section structure template |
| `docs/blog-post.md` | Current draft |
| `docs/design-docs/reward-shaping-research.md` | Reward theory + references |
| `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed |
| `vision/VISION.md` | Product vision, success metrics |
| `training/trl_adapter.py` | Environment adapter code |
| `notebooks/compare_methods.ipynb` | Eval notebook |
| `notebooks/train_grpo.ipynb` | Training notebook |