| # Blog Material β Raw Knowledge Dump |
|
|
| Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`. |
|
|
| ## Training Run Summary |
|
|
| ### Run progression (what each run taught us) |
| 1. **Run 1**: SFT works, GRPO plateaus β no penalty for post-episode waste |
| 2. **Run 2**: Qwen3 tokenizer expands dict args to null params β root cause of first collapse |
| 3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`) |
| 4. **Run 4**: KL penalty + reference model = OOM on L4 |
| 5. **Run 5**: KL too conservative with single-turn SFT β model only calls describe, never queries |
| 6. **Run 6**: Multi-turn SFT breakthrough β first successful training, reward -0.1β0.7 |
| 7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed |
| 8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash |
| 9. **Run 9**: v2 continued training confirms ceiling β more epochs don't help medium questions |
|
|
| ### Key numbers |
| | Metric | Value | |
| |--------|-------| |
| | Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) | |
| | Hardware | Colab L4 (24GB VRAM) | |
| | SFT data | 120 multi-turn trajectories, assistant_only_loss | |
| | Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h | |
| | Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h | |
| | Total training | ~7.3h on L4 for both phases | |
| | Phase 1 reward | -0.1 β 0.5-0.7 (clear upward trend) | |
| | Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) | |
| | Parse rate | >98% (valid tool-call JSON) | |
| | Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) | |
| | Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) | |
| | Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) | |
|
|
| ## Concrete Episode Examples (for blog) |
|
|
| ### Success: Simple aggregation (3 tool calls) |
| Question: "What is the total bonus given in all evaluations?" |
| 1. `describe("evaluation")` β Employee_ID, Year_awarded, Bonus columns |
| 2. `query("SELECT SUM(Bonus) FROM evaluation")` β 19500.0 |
| 3. `answer("19500.0")` β correct |
| Reward: 1.15 |
|
|
| ### Success: Error recovery with JOIN |
| Question: "Which employee received the biggest bonus?" |
| 1. `describe("employee")` β Employee_ID, Name, Age, City |
| 2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` β Error: no such column: Salary |
| 3. `describe("evaluation")` β Employee_ID, Year_awarded, Bonus |
| 4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` β Louis Deacon |
| 5. `answer("Louis Deacon")` β correct |
| Reward: 1.13 |
| |
| ### Success: Subquery with comparison |
| Question: "How many cars have greater accelerate than the one with the most horsepower?" |
| 1. `describe("cars_data")` β Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year |
| 2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` β 39 |
| 3. `answer("39")` β correct |
| Reward: 1.15 |
|
|
| ### Failure: Column name hallucination |
| Question: "What are all the makers and models?" |
| 1. `describe("cars_data")` β Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column) |
| 2. `query("SELECT Id, model FROM cars_data")` β Error: no such column: model |
| 3. `describe("model_list")` β ModelId, Maker, Model |
| 4. `query("SELECT Maker, Model FROM model_list")` β correct results (truncated to 20 rows) |
| 5. `answer("1 | amc\n2 | audi\n3 | bmw...")` β incorrect (truncated, missing rows) |
| Model found the right data but answer was truncated by the 20-row display limit. |
|
|
| ### Failure: FK chain reasoning |
| Question: "Which city has most number of arriving flights?" |
| 1. `describe("AIRPORTS")` β City, AirportCode, AirportName, Country, CountryAbbrev |
| 2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` β Albany | 4 |
| 3. `answer("Albany")` β incorrect |
| Model counted airports per city, not flights per city. Needed to JOIN flightsβairports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing. |
|
|
| ### Failure: Invented table |
| Question: "What are all details of students who registered but did not attend any course?" |
| 1. `describe("Student_Course_Registrations")` β student_id, course_id, registration_date |
| 2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` β syntax error (missing space) |
| 3. `describe("Course_Attendance")` β Error: table not found. Available: Student_Course_Attendance |
| Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget. |
| |
| ## Reward Architecture Details |
| |
| ### Three-layer structure |
| ``` |
| L1 Operational (every step): |
| +0.02 exec_ok |
| +0.01 new_info (unique SQL hash) |
| -0.03 repeat penalty |
| -0.02 step cost |
| |
| L2 Progress (QUERY only): |
| Delta from previous binned progress Γ 0.15 |
| Binned to {0, 0.25, 0.5, 0.75, 1.0} |
| |
| L3 Terminal (ANSWER only): |
| +1.0 correct, 0.0 wrong |
| |
| Per-step clip: [-0.10, 0.15] |
| ``` |
| |
| ### Why potential-based shaping matters |
| - Ng et al. (1999): F(s,s') = Ξ¦(s') - Ξ¦(s) preserves optimal policy |
| - Our delta progress IS potential-based with Ξ³=1 |
| - Cumulative caps are NOT potential-based (depend on trajectory history) |
| - Without this guarantee, agents learn to farm exploration rewards |
| |
| ### Anti-farming mechanisms |
| - Hard budget (15 steps) |
| - Step cost (-0.02) |
| - Repeat penalty (-0.03) |
| - Terminal dominance (1.0 vs ~0.3 max exploration) |
| - Per-step clip [-0.10, 0.15] |
| - Post-episode penalty (-0.3) |
| |
| ## Eval Results (N=50, 2026-04-11) |
| |
| ### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B) |
| | Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust | |
| |--------|----------|------------|-----------|------------|-------------|----------------| |
| | zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 | |
| | 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 | |
| | 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 | |
| | GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 | |
| | GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 | |
| |
| ### Previous run (Run A, same day, same seed) |
| | Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust | |
| |--------|----------|------------|-----------|------------|----------------| |
| | zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 | |
| | 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 | |
| | 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 | |
| | GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 | |
| | GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 | |
| |
| ### Run-to-run variation (important for blog) |
| v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2. |
| |
| The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern β on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings. |
| |
| Note: parse failures no longer end episodes β model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0). |
| |
| ### Key observations from N=50 eval (with retry, 2 runs) |
| 1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs). |
| 2. **Run-to-run variation is ~6-8pp** β v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints. |
| 3. **Base model parse failure loop** β without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion. |
| 4. **GRPO solves format compliance** β 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON. |
| 5. **GRPO failure mode is SQL quality, not format** β episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures. |
| 6. **Extra turns don't help base models** β more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget. |
| 7. **1-shot occasionally gets lucky** β scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp. |
| |
| ### v2 vs v1: similar accuracy, more parse failures β behavioral shift |
| Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training: |
| |
| - **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate). |
| - **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable. |
| - **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference. |
| |
| The v2 parse failure pattern (from raw output): |
| ``` |
| [OK] DESCRIBE: country |
| [OK] QUERY: SELECT Name FROM country WHERE Population < ... |
| [PARSE FAIL] raw: Task complete. β gives up, doesn't call answer() |
| [PARSE FAIL] raw: Task complete. β repeats until budget |
| ``` |
| |
| Compare v1 on the same type of question: |
| ``` |
| [OK] DESCRIBE: country |
| [OK] QUERY: SELECT Name FROM country WHERE ... |
| [OK] ANSWER: European cities and their names are: 42 β wrong, but at least calls answer() |
| ``` |
| |
| This is a form of **calibrated uncertainty** β v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes. |
| |
| ### For the blog narrative |
| The story is clear: GRPO teaches format compliance (0% β 95-100% parse rate) and strategic tool use (describeβqueryβanswer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" β the variation between runs is larger than the difference between checkpoints. |
| |
| ## Format Mismatch Discovery (F011) |
| |
| ### The three differences between eval and training |
| 1. **role:tool vs role:user** β Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure. |
| 2. **Structured tool_calls vs raw text** β Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`. |
| 3. **No separator vs `\n\n`** β TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`. |
|
|
| ### Impact |
| Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint). |
| After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%. |
|
|
| ### Lesson |
| Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing. |
|
|
| ## Multi-Turn SFT β Why It's Critical |
|
|
| ### Per-turn SFT (broken) |
| - 347 examples, each one assistant turn |
| - ~50% were describe calls |
| - Model learned: "when asked a question, call describe" |
| - With KL penalty, model stayed anchored to this single-action policy |
| - Result: reward=0.00, all rollouts identical, advantage=0 |
|
|
| ### Multi-turn SFT (working) |
| - 120 examples, each a full describeβqueryβanswer trajectory |
| - `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags) |
| - Model learned: the SEQUENCE describeβqueryβanswer |
| - With KL penalty, model explores within the multi-turn strategy |
| - Result: reward climbs to 0.7 in Phase 1 |
|
|
| ### Template patch detail |
| Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`). |
|
|
| ## The 0.6B Capacity Ceiling |
|
|
| ### What works at 0.6B |
| - Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT |
| - Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID) |
| - WHERE with LIKE, IN, BETWEEN, NOT IN subqueries |
| - Answer formatting: comma lists, pipe-delimited rows, `[]` for empty |
| - Error recovery: describe after SQL error, retry with correct column names |
| - `sample` tool usage (learned in Run 6, inconsistent later) |
|
|
| ### What breaks at 0.6B |
| - FK chain reasoning: 3+ table joins (DocumentsβTemplatesβRef_Template_Types) |
| - Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL |
| - Ambiguous column resolution: joins with same column name in both tables |
| - Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING |
| - "stadium without concert" pattern: NOT IN with JOIN to get names |
| - Aggregate + GROUP BY + HAVING chains on multi-table joins |
|
|
| ### The hallucination pattern |
| The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem β the schema is in the context window. It's a weight problem β pretraining biases override in-context information at 0.6B scale. |
|
|
| ## Thinking Mode Observations (Run 8) |
|
|
| ### Benefits |
| - Reasons through SQL errors: "no such column: airport_code" β `<think>` block β tries `AirportCode` |
| - Empty `<think></think>` on easy questions β token-efficient, emergent behavior |
| - Multi-step join planning in think blocks |
| |
| ### New failure mode |
| ~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage. |
| |
| ### OOM risk |
| Thinking blocks consume more tokens β higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3. |
|
|
| ## Narrative Threads for Blog |
|
|
| ### "The environment is the product" |
| From VISION.md: "SQLEnv is a reinforcement learning environment β not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure. |
|
|
| ### "Small model showing improvement proves more than large model with marginal gains" |
| A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describeβqueryβanswer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment. |
|
|
| ### "Analysts don't write perfect queries from scratch" |
| The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process. |
|
|
| ### "Dense rewards need theory" |
| Potential-based shaping isn't just good practice β it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards. |
|
|
| ### "Multi-turn SFT teaches strategy, not actions" |
| The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation. |
|
|
| ## References for Blog |
|
|
| - Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML. |
| - DeepSeek-AI (2025). DeepSeek-R1. |
| - Shao et al. (2024). DeepSeek-Math: GRPO. |
| - Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026. |
| - Yu et al. (2018). Spider dataset. |
| - Li et al. (2023). BIRD benchmark. |
| - TIPS (2026). Turn-Level Information-Potential Reward Shaping. |
| - ToolRL (2025). Reward is All Tool Learning Needs. |
| - StepTool (2024). Step-grained RL for Tool Learning. |
|
|
| ## Showcase Notebook Transcripts (for blog) |
|
|
| ### Random agent episode (seed=7) β comedic failure |
| Question: "Count the number of paragraphs." |
| ``` |
| SAMPLE Paragraphs β reward=0.015 |
| SAMPLE Documents β reward=0.015 |
| DESCRIBE Documents β reward=0.015 |
| SAMPLE Documents β reward=0.015 (repeat) |
| DESCRIBE Documents β reward=0.015 (repeat) |
| DESCRIBE Documents β reward=0.015 (repeat) |
| DESCRIBE Templates β reward=0.015 |
| SAMPLE Documents β reward=0.015 (repeat) |
| DESCRIBE Documents β reward=0.015 (repeat) |
| QUERY SELECT * FROM "Templates" LIMIT 5 β reward=0.0625 |
| DESCRIBE Documents β reward=0.015 (repeat) |
| DESCRIBE Paragraphs β reward=0.015 |
| QUERY SELECT * FROM "Paragraphs" LIMIT 5 β reward=0.025 |
| QUERY SELECT * FROM "Documents" LIMIT 5 β reward=0.025 |
| ANSWER 76 | 20 | Robbin CV | y | None β reward=0.000 (incorrect) |
| ``` |
| Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`. |
|
|
| ### Oracle agent episode (seed=0) β clean solve |
| Question: "List the id of students who registered some courses and the number of their registered courses?" |
| ``` |
| Step 1: DESCRIBE student_course_registrations |
| β student_id INTEGER, course_id INTEGER, registration_date DATETIME |
| β reward: +0.015 |
| |
| Step 2: DESCRIBE students |
| β student_id INTEGER, student_details VARCHAR(255) |
| β reward: +0.015 |
| |
| Step 3: QUERY |
| SELECT T1.student_id, count(*) |
| FROM students AS T1 |
| JOIN student_course_registrations AS T2 |
| ON T1.student_id = T2.student_id |
| GROUP BY T1.student_id |
| β 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1 |
| β reward: +0.150 |
| |
| Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]] |
| β correct |
| β reward: +1.000 |
| ``` |
| Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000. |
|
|
| ### Baseline comparison (50 episodes each) |
| | Policy | Success Rate | Avg Reward | Avg Steps | |
| |--------|-------------|------------|-----------| |
| | Random | 0.0% | 0.247 | 15.0 | |
| | Oracle | 100.0% | 1.168 | 3.5 | |
|
|
| The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between. |
|
|
| ### Reward constants (from server/reward.py) |
| ``` |
| +0.02 successful execution (no errors) |
| +0.01 new information (unique query) |
| -0.02 step cost (every action) |
| -0.03 repeat penalty (duplicate SQL) |
| [-0.10, +0.15] per-step clipping range |
| +1.0 correct answer (terminal) |
| +0.0 wrong answer (terminal) |
| ``` |
| Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0. |
|
|
| ## Competition Context |
|
|
| ### OpenEnv Challenge (our target) |
| - Sponsors: PyTorch/Meta, HuggingFace, Unsloth |
| - Prize: $10K HF credits |
| - Judging: primarily blog-based |
| - Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper |
| - Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge. |
|
|
| ### Deliverables |
| 1. Environment on HF Hub β **live** at https://huggingface.co/spaces/hjerpe/sql_env |
| (pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`) |
| 2. Training notebooks/scripts on GitHub β `notebooks/train_grpo.ipynb`, |
| `notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb` |
| 3. Blog on HuggingFace β `docs/blog-post-v1.md` (draft) |
|
|
| ### TRL integration status (already done β do not re-research) |
| `training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory` |
| class: `reset()` + named tool methods `describe() / sample() / query() / |
| answer()` with docstrings TRL uses to build the tool schema. The notebook |
| passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL, |
| reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment` |
| **in-process** (not a WebSocket client to the HF Space) β intentional, because |
| training opens N parallel sessions and the Space defaults to 1. |
| |
| ### Competitive landscape |
| - **SQL Repair** (WALKMAN303) β buggy SQL fix, simpler than our multi-turn exploration |
| - **Calendar Gym** (Turing) β featured on HF blog, real-world framing + failure analysis |
| - **OpenSec** β cybersecurity with arXiv paper, adversarial evidence injection |
| - Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery. |
| |
| ### What winning entries do |
| 1. Stakes framing β "this matters in production" |
| 2. Concrete failure analysis with numbers |
| 3. Contrast (random vs trained vs oracle) |
| 4. Real data, not toy puzzles |
| 5. Non-obvious insights from training |
| |
| ## Green Agent Evaluator |
| |
| ### What it is |
| OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge. |
|
|
| ### Implementation |
| - `evaluation/policies.py` β `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult` |
| - `evaluation/oracle_policy.py` β `OraclePolicy` baseline (runs gold SQL) |
| - `tests/test_evaluation.py` β 17 tests, all passing (unit + integration) |
|
|
| ### How it works |
| ```python |
| from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy |
| |
| # Run 50 episodes with random policy |
| result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0) |
| print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}") |
| |
| # Run with trained policy (any class with select_action method) |
| result = evaluate(env, trained_policy, n_episodes=50, seed=42) |
| ``` |
|
|
| ### Where it's used |
| - `notebooks/showcase_sqlenv.ipynb` β Random vs Oracle baseline comparison |
| - `notebooks/compare_methods.ipynb` β All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()` |
|
|
| ### Key design choices |
| - **Error isolation**: one episode crashing doesn't kill the run β logged as `EpisodeResult(error=str(exc))` |
| - **Deterministic seeding**: `seed + episode_index` per episode for reproducibility |
| - **Protocol-based**: any class with `select_action(observation) -> action` works β no inheritance required |
| - **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down |
|
|
| ### For the blog |
| The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for. |
|
|
| ## Files to Reference |
|
|
| | File | Relevance | |
| |------|-----------| |
| | `docs/blog-outline.md` | Section structure template | |
| | `docs/blog-post.md` | Current draft | |
| | `docs/design-docs/reward-shaping-research.md` | Reward theory + references | |
| | `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed | |
| | `vision/VISION.md` | Product vision, success metrics | |
| | `training/trl_adapter.py` | Environment adapter code | |
| | `notebooks/compare_methods.ipynb` | Eval notebook | |
| | `notebooks/train_grpo.ipynb` | Training notebook | |
|
|