Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / docs /blog-material.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 28 days ago

preview code

raw

history blame contribute delete

24.3 kB

	# Blog Material — Raw Knowledge Dump

	Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.

	## Training Run Summary

	### Run progression (what each run taught us)
	1. Run 1: SFT works, GRPO plateaus — no penalty for post-episode waste
	2. Run 2: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
	3. Run 3: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
	4. Run 4: KL penalty + reference model = OOM on L4
	5. Run 5: KL too conservative with single-turn SFT — model only calls describe, never queries
	6. Run 6: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
	7. Run 7: Repeat penalty, stable training, multi-table weakness exposed
	8. Run 8: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
	9. Run 9: v2 continued training confirms ceiling — more epochs don't help medium questions

	### Key numbers
	\| Metric \| Value \|
	\|--------\|-------\|
	\| Model \| Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) \|
	\| Hardware \| Colab L4 (24GB VRAM) \|
	\| SFT data \| 120 multi-turn trajectories, assistant_only_loss \|
	\| Phase 1 \| 435 easy questions, beta=0.04, 512 tokens, ~3.5h \|
	\| Phase 2 \| 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h \|
	\| Total training \| ~7.3h on L4 for both phases \|
	\| Phase 1 reward \| -0.1 → 0.5-0.7 (clear upward trend) \|
	\| Phase 2 reward \| oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) \|
	\| Parse rate \| >98% (valid tool-call JSON) \|
	\| Eval accuracy (GRPO v1, N=50) \| 28-30% across runs (vs 0% base model) \|
	\| Eval accuracy (GRPO v2, N=50) \| 24-32% across runs (within noise of v1) \|
	\| Eval accuracy (base, 3-shot) \| 0% (format compliance only 19-20%) \|

	## Concrete Episode Examples (for blog)

	### Success: Simple aggregation (3 tool calls)
	Question: "What is the total bonus given in all evaluations?"
	1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns
	2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0
	3. `answer("19500.0")` → correct
	Reward: 1.15

	### Success: Error recovery with JOIN
	Question: "Which employee received the biggest bonus?"
	1. `describe("employee")` → Employee_ID, Name, Age, City
	2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary
	3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus
	4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon
	5. `answer("Louis Deacon")` → correct
	Reward: 1.13

	### Success: Subquery with comparison
	Question: "How many cars have greater accelerate than the one with the most horsepower?"
	1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
	2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39
	3. `answer("39")` → correct
	Reward: 1.15

	### Failure: Column name hallucination
	Question: "What are all the makers and models?"
	1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
	2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model
	3. `describe("model_list")` → ModelId, Maker, Model
	4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows)
	5. `answer("1 \| amc\n2 \| audi\n3 \| bmw...")` → incorrect (truncated, missing rows)
	Model found the right data but answer was truncated by the 20-row display limit.

	### Failure: FK chain reasoning
	Question: "Which city has most number of arriving flights?"
	1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev
	2. `query("SELECT CITY, COUNT() FROM AIRPORTS GROUP BY CITY ORDER BY COUNT() DESC LIMIT 1")` → Albany \| 4
	3. `answer("Albany")` → incorrect
	Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.

	### Failure: Invented table
	Question: "What are all details of students who registered but did not attend any course?"
	1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date
	2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space)
	3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance
	Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.

	## Reward Architecture Details

	### Three-layer structure
	```
	L1 Operational (every step):
	+0.02 exec_ok
	+0.01 new_info (unique SQL hash)
	-0.03 repeat penalty
	-0.02 step cost

	L2 Progress (QUERY only):
	Delta from previous binned progress × 0.15
	Binned to {0, 0.25, 0.5, 0.75, 1.0}

	L3 Terminal (ANSWER only):
	+1.0 correct, 0.0 wrong

	Per-step clip: [-0.10, 0.15]
	```

	### Why potential-based shaping matters
	- Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
	- Our delta progress IS potential-based with γ=1
	- Cumulative caps are NOT potential-based (depend on trajectory history)
	- Without this guarantee, agents learn to farm exploration rewards

	### Anti-farming mechanisms
	- Hard budget (15 steps)
	- Step cost (-0.02)
	- Repeat penalty (-0.03)
	- Terminal dominance (1.0 vs ~0.3 max exploration)
	- Per-step clip [-0.10, 0.15]
	- Post-episode penalty (-0.3)

	## Eval Results (N=50, 2026-04-11)

	### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
	\| Method \| Accuracy \| Avg Reward \| Avg Steps \| Parse Rate \| Parse Fails \| Budget Exhaust \|
	\|--------\|----------\|------------\|-----------\|------------\|-------------\|----------------\|
	\| zero-shot \| 0.0% \| 0.007 \| 12.4 \| 23.6% \| 434 \| 38 \|
	\| 1-shot \| 2.0% \| 0.061 \| 14.0 \| 17.0% \| 537 \| 46 \|
	\| 3-shot \| 0.0% \| 0.057 \| 14.8 \| 19.0% \| 551 \| 49 \|
	\| GRPO v1 \| 30.0% \| 0.386 \| 3.5 \| 100.0% \| 0 \| 0 \|
	\| GRPO v2 \| 24.0% \| 0.321 \| 3.6 \| 95.1% \| 8 \| 1 \|

	### Previous run (Run A, same day, same seed)
	\| Method \| Accuracy \| Avg Reward \| Avg Steps \| Parse Rate \| Budget Exhaust \|
	\|--------\|----------\|------------\|-----------\|------------\|----------------\|
	\| zero-shot \| 0.0% \| 0.016 \| 10.8 \| 28.1% \| 31/50 \|
	\| 1-shot \| 0.0% \| 0.031 \| 14.8 \| 15.6% \| 49/50 \|
	\| 3-shot \| 0.0% \| 0.041 \| 13.8 \| 20.3% \| 44/50 \|
	\| GRPO v1 \| 28.0% \| 0.355 \| 4.0 \| 95.0% \| 2/50 \|
	\| GRPO v2 \| 32.0% \| 0.400 \| 3.7 \| 87.1% \| 2/50 \|

	### Run-to-run variation (important for blog)
	v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is within run-to-run variation (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.

	The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.

	Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).

	### Key observations from N=50 eval (with retry, 2 runs)
	1. ~30% accuracy for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
	2. Run-to-run variation is ~6-8pp — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
	3. Base model parse failure loop — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
	4. GRPO solves format compliance — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
	5. GRPO failure mode is SQL quality, not format — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
	6. Extra turns don't help base models — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
	7. 1-shot occasionally gets lucky — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.

	### v2 vs v1: similar accuracy, more parse failures — behavioral shift
	Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:

	- v1 guesses more: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
	- v2 gives up on hard questions: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
	- Neither is clearly better: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.

	The v2 parse failure pattern (from raw output):
	```
	[OK] DESCRIBE: country
	[OK] QUERY: SELECT Name FROM country WHERE Population < ...
	[PARSE FAIL] raw: Task complete. ← gives up, doesn't call answer()
	[PARSE FAIL] raw: Task complete. ← repeats until budget
	```

	Compare v1 on the same type of question:
	```
	[OK] DESCRIBE: country
	[OK] QUERY: SELECT Name FROM country WHERE ...
	[OK] ANSWER: European cities and their names are: 42 ← wrong, but at least calls answer()
	```

	This is a form of calibrated uncertainty — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.

	### For the blog narrative
	The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.

	## Format Mismatch Discovery (F011)

	### The three differences between eval and training
	1. role:tool vs role:user — Qwen3 renders `role:"tool"` as `<\|im_start\|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<\|im_start\|>user\nplain text`. Same role token, different content structure.
	2. Structured tool_calls vs raw text — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
	3. No separator vs `\n\n` — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.

	### Impact
	Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
	After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.

	### Lesson
	Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.

	## Multi-Turn SFT — Why It's Critical

	### Per-turn SFT (broken)
	- 347 examples, each one assistant turn
	- ~50% were describe calls
	- Model learned: "when asked a question, call describe"
	- With KL penalty, model stayed anchored to this single-action policy
	- Result: reward=0.00, all rollouts identical, advantage=0

	### Multi-turn SFT (working)
	- 120 examples, each a full describe→query→answer trajectory
	- `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
	- Model learned: the SEQUENCE describe→query→answer
	- With KL penalty, model explores within the multi-turn strategy
	- Result: reward climbs to 0.7 in Phase 1

	### Template patch detail
	Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).

	## The 0.6B Capacity Ceiling

	### What works at 0.6B
	- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
	- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
	- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
	- Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
	- Error recovery: describe after SQL error, retry with correct column names
	- `sample` tool usage (learned in Run 6, inconsistent later)

	### What breaks at 0.6B
	- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
	- Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
	- Ambiguous column resolution: joins with same column name in both tables
	- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
	- "stadium without concert" pattern: NOT IN with JOIN to get names
	- Aggregate + GROUP BY + HAVING chains on multi-table joins

	### The hallucination pattern
	The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.

	## Thinking Mode Observations (Run 8)

	### Benefits
	- Reasons through SQL errors: "no such column: airport_code" → `<think>` block → tries `AirportCode`
	- Empty `<think></think>` on easy questions — token-efficient, emergent behavior
	- Multi-step join planning in think blocks

	### New failure mode
	~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.

	### OOM risk
	Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.

	## Narrative Threads for Blog

	### "The environment is the product"
	From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.

	### "Small model showing improvement proves more than large model with marginal gains"
	A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.

	### "Analysts don't write perfect queries from scratch"
	The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.

	### "Dense rewards need theory"
	Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.

	### "Multi-turn SFT teaches strategy, not actions"
	The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.

	## References for Blog

	- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
	- DeepSeek-AI (2025). DeepSeek-R1.
	- Shao et al. (2024). DeepSeek-Math: GRPO.
	- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
	- Yu et al. (2018). Spider dataset.
	- Li et al. (2023). BIRD benchmark.
	- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
	- ToolRL (2025). Reward is All Tool Learning Needs.
	- StepTool (2024). Step-grained RL for Tool Learning.

	## Showcase Notebook Transcripts (for blog)

	### Random agent episode (seed=7) — comedic failure
	Question: "Count the number of paragraphs."
	```
	SAMPLE Paragraphs → reward=0.015
	SAMPLE Documents → reward=0.015
	DESCRIBE Documents → reward=0.015
	SAMPLE Documents → reward=0.015 (repeat)
	DESCRIBE Documents → reward=0.015 (repeat)
	DESCRIBE Documents → reward=0.015 (repeat)
	DESCRIBE Templates → reward=0.015
	SAMPLE Documents → reward=0.015 (repeat)
	DESCRIBE Documents → reward=0.015 (repeat)
	QUERY SELECT * FROM "Templates" LIMIT 5 → reward=0.0625
	DESCRIBE Documents → reward=0.015 (repeat)
	DESCRIBE Paragraphs → reward=0.015
	QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
	QUERY SELECT * FROM "Documents" LIMIT 5 → reward=0.025
	ANSWER 76 \| 20 \| Robbin CV \| y \| None → reward=0.000 (incorrect)
	```
	Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.

	### Oracle agent episode (seed=0) — clean solve
	Question: "List the id of students who registered some courses and the number of their registered courses?"
	```
	Step 1: DESCRIBE student_course_registrations
	→ student_id INTEGER, course_id INTEGER, registration_date DATETIME
	→ reward: +0.015

	Step 2: DESCRIBE students
	→ student_id INTEGER, student_details VARCHAR(255)
	→ reward: +0.015

	Step 3: QUERY
	SELECT T1.student_id, count(*)
	FROM students AS T1
	JOIN student_course_registrations AS T2
	ON T1.student_id = T2.student_id
	GROUP BY T1.student_id
	→ 111\|1, 121\|2, 131\|1, 141\|2, 151\|1, 161\|1, 171\|1
	→ reward: +0.150

	Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
	→ correct
	→ reward: +1.000
	```
	Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.

	### Baseline comparison (50 episodes each)
	\| Policy \| Success Rate \| Avg Reward \| Avg Steps \|
	\|--------\|-------------\|------------\|-----------\|
	\| Random \| 0.0% \| 0.247 \| 15.0 \|
	\| Oracle \| 100.0% \| 1.168 \| 3.5 \|

	The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.

	### Reward constants (from server/reward.py)
	```
	+0.02 successful execution (no errors)
	+0.01 new information (unique query)
	-0.02 step cost (every action)
	-0.03 repeat penalty (duplicate SQL)
	[-0.10, +0.15] per-step clipping range
	+1.0 correct answer (terminal)
	+0.0 wrong answer (terminal)
	```
	Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.

	## Competition Context

	### OpenEnv Challenge (our target)
	- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
	- Prize: $10K HF credits
	- Judging: primarily blog-based
	- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
	- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.

	### Deliverables
	1. Environment on HF Hub — live at https://huggingface.co/spaces/hjerpe/sql_env
	(pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
	2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`,
	`notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
	3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft)

	### TRL integration status (already done — do not re-research)
	`training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
	class: `reset()` + named tool methods `describe() / sample() / query() /
	answer()` with docstrings TRL uses to build the tool schema. The notebook
	passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
	reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
	in-process (not a WebSocket client to the HF Space) — intentional, because
	training opens N parallel sessions and the Space defaults to 1.

	### Competitive landscape
	- SQL Repair (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
	- Calendar Gym (Turing) — featured on HF blog, real-world framing + failure analysis
	- OpenSec — cybersecurity with arXiv paper, adversarial evidence injection
	- Our position: no interactive SQL exploration environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.

	### What winning entries do
	1. Stakes framing — "this matters in production"
	2. Concrete failure analysis with numbers
	3. Contrast (random vs trained vs oracle)
	4. Real data, not toy puzzles
	5. Non-obvious insights from training

	## Green Agent Evaluator

	### What it is
	OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.

	### Implementation
	- `evaluation/policies.py` — `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
	- `evaluation/oracle_policy.py` — `OraclePolicy` baseline (runs gold SQL)
	- `tests/test_evaluation.py` — 17 tests, all passing (unit + integration)

	### How it works
	```python
	from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

	# Run 50 episodes with random policy
	result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
	print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

	# Run with trained policy (any class with select_action method)
	result = evaluate(env, trained_policy, n_episodes=50, seed=42)
	```

	### Where it's used
	- `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison
	- `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`

	### Key design choices
	- Error isolation: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))`
	- Deterministic seeding: `seed + episode_index` per episode for reproducibility
	- Protocol-based: any class with `select_action(observation) -> action` works — no inheritance required
	- Aggregate + per-episode: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down

	### For the blog
	The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.

	## Files to Reference

	\| File \| Relevance \|
	\|------\|-----------\|
	\| `docs/blog-outline.md` \| Section structure template \|
	\| `docs/blog-post.md` \| Current draft \|
	\| `docs/design-docs/reward-shaping-research.md` \| Reward theory + references \|
	\| `docs/exploration/grpo-training-session-log.md` \| All 9 runs detailed \|
	\| `vision/VISION.md` \| Product vision, success metrics \|
	\| `training/trl_adapter.py` \| Environment adapter code \|
	\| `notebooks/compare_methods.ipynb` \| Eval notebook \|
	\| `notebooks/train_grpo.ipynb` \| Training notebook \|