sql_env / docs /blog-material.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified
# Blog Material β€” Raw Knowledge Dump
Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.
## Training Run Summary
### Run progression (what each run taught us)
1. **Run 1**: SFT works, GRPO plateaus β€” no penalty for post-episode waste
2. **Run 2**: Qwen3 tokenizer expands dict args to null params β€” root cause of first collapse
3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
4. **Run 4**: KL penalty + reference model = OOM on L4
5. **Run 5**: KL too conservative with single-turn SFT β€” model only calls describe, never queries
6. **Run 6**: Multi-turn SFT breakthrough β€” first successful training, reward -0.1β†’0.7
7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed
8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
9. **Run 9**: v2 continued training confirms ceiling β€” more epochs don't help medium questions
### Key numbers
| Metric | Value |
|--------|-------|
| Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
| Hardware | Colab L4 (24GB VRAM) |
| SFT data | 120 multi-turn trajectories, assistant_only_loss |
| Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
| Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
| Total training | ~7.3h on L4 for both phases |
| Phase 1 reward | -0.1 β†’ 0.5-0.7 (clear upward trend) |
| Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
| Parse rate | >98% (valid tool-call JSON) |
| Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
| Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
| Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |
## Concrete Episode Examples (for blog)
### Success: Simple aggregation (3 tool calls)
Question: "What is the total bonus given in all evaluations?"
1. `describe("evaluation")` β†’ Employee_ID, Year_awarded, Bonus columns
2. `query("SELECT SUM(Bonus) FROM evaluation")` β†’ 19500.0
3. `answer("19500.0")` β†’ correct
Reward: 1.15
### Success: Error recovery with JOIN
Question: "Which employee received the biggest bonus?"
1. `describe("employee")` β†’ Employee_ID, Name, Age, City
2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` β†’ Error: no such column: Salary
3. `describe("evaluation")` β†’ Employee_ID, Year_awarded, Bonus
4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` β†’ Louis Deacon
5. `answer("Louis Deacon")` β†’ correct
Reward: 1.13
### Success: Subquery with comparison
Question: "How many cars have greater accelerate than the one with the most horsepower?"
1. `describe("cars_data")` β†’ Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` β†’ 39
3. `answer("39")` β†’ correct
Reward: 1.15
### Failure: Column name hallucination
Question: "What are all the makers and models?"
1. `describe("cars_data")` β†’ Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
2. `query("SELECT Id, model FROM cars_data")` β†’ Error: no such column: model
3. `describe("model_list")` β†’ ModelId, Maker, Model
4. `query("SELECT Maker, Model FROM model_list")` β†’ correct results (truncated to 20 rows)
5. `answer("1 | amc\n2 | audi\n3 | bmw...")` β†’ incorrect (truncated, missing rows)
Model found the right data but answer was truncated by the 20-row display limit.
### Failure: FK chain reasoning
Question: "Which city has most number of arriving flights?"
1. `describe("AIRPORTS")` β†’ City, AirportCode, AirportName, Country, CountryAbbrev
2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` β†’ Albany | 4
3. `answer("Albany")` β†’ incorrect
Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.
### Failure: Invented table
Question: "What are all details of students who registered but did not attend any course?"
1. `describe("Student_Course_Registrations")` β†’ student_id, course_id, registration_date
2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` β†’ syntax error (missing space)
3. `describe("Course_Attendance")` β†’ Error: table not found. Available: Student_Course_Attendance
Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.
## Reward Architecture Details
### Three-layer structure
```
L1 Operational (every step):
+0.02 exec_ok
+0.01 new_info (unique SQL hash)
-0.03 repeat penalty
-0.02 step cost
L2 Progress (QUERY only):
Delta from previous binned progress Γ— 0.15
Binned to {0, 0.25, 0.5, 0.75, 1.0}
L3 Terminal (ANSWER only):
+1.0 correct, 0.0 wrong
Per-step clip: [-0.10, 0.15]
```
### Why potential-based shaping matters
- Ng et al. (1999): F(s,s') = Ξ¦(s') - Ξ¦(s) preserves optimal policy
- Our delta progress IS potential-based with Ξ³=1
- Cumulative caps are NOT potential-based (depend on trajectory history)
- Without this guarantee, agents learn to farm exploration rewards
### Anti-farming mechanisms
- Hard budget (15 steps)
- Step cost (-0.02)
- Repeat penalty (-0.03)
- Terminal dominance (1.0 vs ~0.3 max exploration)
- Per-step clip [-0.10, 0.15]
- Post-episode penalty (-0.3)
## Eval Results (N=50, 2026-04-11)
### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
|--------|----------|------------|-----------|------------|-------------|----------------|
| zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
| 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
| 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
| GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
| GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |
### Previous run (Run A, same day, same seed)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
|--------|----------|------------|-----------|------------|----------------|
| zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
| 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
| 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
| GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
| GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |
### Run-to-run variation (important for blog)
v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.
The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern β€” on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.
Note: parse failures no longer end episodes β€” model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).
### Key observations from N=50 eval (with retry, 2 runs)
1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
2. **Run-to-run variation is ~6-8pp** β€” v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
3. **Base model parse failure loop** β€” without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
4. **GRPO solves format compliance** β€” 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
5. **GRPO failure mode is SQL quality, not format** β€” episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
6. **Extra turns don't help base models** β€” more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
7. **1-shot occasionally gets lucky** β€” scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.
### v2 vs v1: similar accuracy, more parse failures β€” behavioral shift
Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:
- **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
- **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
- **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.
The v2 parse failure pattern (from raw output):
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete. ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete. ← repeats until budget
```
Compare v1 on the same type of question:
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42 ← wrong, but at least calls answer()
```
This is a form of **calibrated uncertainty** β€” v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.
### For the blog narrative
The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.
## Format Mismatch Discovery (F011)
### The three differences between eval and training
1. **role:tool vs role:user** β€” Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure.
2. **Structured tool_calls vs raw text** β€” Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
3. **No separator vs `\n\n`** β€” TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.
### Impact
Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.
### Lesson
Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.
## Multi-Turn SFT β€” Why It's Critical
### Per-turn SFT (broken)
- 347 examples, each one assistant turn
- ~50% were describe calls
- Model learned: "when asked a question, call describe"
- With KL penalty, model stayed anchored to this single-action policy
- Result: reward=0.00, all rollouts identical, advantage=0
### Multi-turn SFT (working)
- 120 examples, each a full describe→query→answer trajectory
- `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
- Model learned: the SEQUENCE describe→query→answer
- With KL penalty, model explores within the multi-turn strategy
- Result: reward climbs to 0.7 in Phase 1
### Template patch detail
Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).
## The 0.6B Capacity Ceiling
### What works at 0.6B
- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
- Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
- Error recovery: describe after SQL error, retry with correct column names
- `sample` tool usage (learned in Run 6, inconsistent later)
### What breaks at 0.6B
- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
- Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
- Ambiguous column resolution: joins with same column name in both tables
- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
- "stadium without concert" pattern: NOT IN with JOIN to get names
- Aggregate + GROUP BY + HAVING chains on multi-table joins
### The hallucination pattern
The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem β€” the schema is in the context window. It's a weight problem β€” pretraining biases override in-context information at 0.6B scale.
## Thinking Mode Observations (Run 8)
### Benefits
- Reasons through SQL errors: "no such column: airport_code" β†’ `<think>` block β†’ tries `AirportCode`
- Empty `<think></think>` on easy questions β€” token-efficient, emergent behavior
- Multi-step join planning in think blocks
### New failure mode
~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.
### OOM risk
Thinking blocks consume more tokens β†’ higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.
## Narrative Threads for Blog
### "The environment is the product"
From VISION.md: "SQLEnv is a reinforcement learning environment β€” not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.
### "Small model showing improvement proves more than large model with marginal gains"
A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.
### "Analysts don't write perfect queries from scratch"
The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.
### "Dense rewards need theory"
Potential-based shaping isn't just good practice β€” it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.
### "Multi-turn SFT teaches strategy, not actions"
The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.
## References for Blog
- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
- DeepSeek-AI (2025). DeepSeek-R1.
- Shao et al. (2024). DeepSeek-Math: GRPO.
- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
- Yu et al. (2018). Spider dataset.
- Li et al. (2023). BIRD benchmark.
- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
- ToolRL (2025). Reward is All Tool Learning Needs.
- StepTool (2024). Step-grained RL for Tool Learning.
## Showcase Notebook Transcripts (for blog)
### Random agent episode (seed=7) β€” comedic failure
Question: "Count the number of paragraphs."
```
SAMPLE Paragraphs β†’ reward=0.015
SAMPLE Documents β†’ reward=0.015
DESCRIBE Documents β†’ reward=0.015
SAMPLE Documents β†’ reward=0.015 (repeat)
DESCRIBE Documents β†’ reward=0.015 (repeat)
DESCRIBE Documents β†’ reward=0.015 (repeat)
DESCRIBE Templates β†’ reward=0.015
SAMPLE Documents β†’ reward=0.015 (repeat)
DESCRIBE Documents β†’ reward=0.015 (repeat)
QUERY SELECT * FROM "Templates" LIMIT 5 β†’ reward=0.0625
DESCRIBE Documents β†’ reward=0.015 (repeat)
DESCRIBE Paragraphs β†’ reward=0.015
QUERY SELECT * FROM "Paragraphs" LIMIT 5 β†’ reward=0.025
QUERY SELECT * FROM "Documents" LIMIT 5 β†’ reward=0.025
ANSWER 76 | 20 | Robbin CV | y | None β†’ reward=0.000 (incorrect)
```
Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.
### Oracle agent episode (seed=0) β€” clean solve
Question: "List the id of students who registered some courses and the number of their registered courses?"
```
Step 1: DESCRIBE student_course_registrations
β†’ student_id INTEGER, course_id INTEGER, registration_date DATETIME
β†’ reward: +0.015
Step 2: DESCRIBE students
β†’ student_id INTEGER, student_details VARCHAR(255)
β†’ reward: +0.015
Step 3: QUERY
SELECT T1.student_id, count(*)
FROM students AS T1
JOIN student_course_registrations AS T2
ON T1.student_id = T2.student_id
GROUP BY T1.student_id
β†’ 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
β†’ reward: +0.150
Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
β†’ correct
β†’ reward: +1.000
```
Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.
### Baseline comparison (50 episodes each)
| Policy | Success Rate | Avg Reward | Avg Steps |
|--------|-------------|------------|-----------|
| Random | 0.0% | 0.247 | 15.0 |
| Oracle | 100.0% | 1.168 | 3.5 |
The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.
### Reward constants (from server/reward.py)
```
+0.02 successful execution (no errors)
+0.01 new information (unique query)
-0.02 step cost (every action)
-0.03 repeat penalty (duplicate SQL)
[-0.10, +0.15] per-step clipping range
+1.0 correct answer (terminal)
+0.0 wrong answer (terminal)
```
Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.
## Competition Context
### OpenEnv Challenge (our target)
- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
- Prize: $10K HF credits
- Judging: primarily blog-based
- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.
### Deliverables
1. Environment on HF Hub β€” **live** at https://huggingface.co/spaces/hjerpe/sql_env
(pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
2. Training notebooks/scripts on GitHub β€” `notebooks/train_grpo.ipynb`,
`notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
3. Blog on HuggingFace β€” `docs/blog-post-v1.md` (draft)
### TRL integration status (already done β€” do not re-research)
`training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
class: `reset()` + named tool methods `describe() / sample() / query() /
answer()` with docstrings TRL uses to build the tool schema. The notebook
passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
**in-process** (not a WebSocket client to the HF Space) β€” intentional, because
training opens N parallel sessions and the Space defaults to 1.
### Competitive landscape
- **SQL Repair** (WALKMAN303) β€” buggy SQL fix, simpler than our multi-turn exploration
- **Calendar Gym** (Turing) β€” featured on HF blog, real-world framing + failure analysis
- **OpenSec** β€” cybersecurity with arXiv paper, adversarial evidence injection
- Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.
### What winning entries do
1. Stakes framing β€” "this matters in production"
2. Concrete failure analysis with numbers
3. Contrast (random vs trained vs oracle)
4. Real data, not toy puzzles
5. Non-obvious insights from training
## Green Agent Evaluator
### What it is
OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.
### Implementation
- `evaluation/policies.py` β€” `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
- `evaluation/oracle_policy.py` β€” `OraclePolicy` baseline (runs gold SQL)
- `tests/test_evaluation.py` β€” 17 tests, all passing (unit + integration)
### How it works
```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)
```
### Where it's used
- `notebooks/showcase_sqlenv.ipynb` β€” Random vs Oracle baseline comparison
- `notebooks/compare_methods.ipynb` β€” All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`
### Key design choices
- **Error isolation**: one episode crashing doesn't kill the run β€” logged as `EpisodeResult(error=str(exc))`
- **Deterministic seeding**: `seed + episode_index` per episode for reproducibility
- **Protocol-based**: any class with `select_action(observation) -> action` works β€” no inheritance required
- **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down
### For the blog
The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.
## Files to Reference
| File | Relevance |
|------|-----------|
| `docs/blog-outline.md` | Section structure template |
| `docs/blog-post.md` | Current draft |
| `docs/design-docs/reward-shaping-research.md` | Reward theory + references |
| `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed |
| `vision/VISION.md` | Product vision, success metrics |
| `training/trl_adapter.py` | Environment adapter code |
| `notebooks/compare_methods.ipynb` | Eval notebook |
| `notebooks/train_grpo.ipynb` | Training notebook |