Spaces:

hjerpe
/

sql_env

Sleeping

File size: 24,257 Bytes

9e64e71

# Blog Material — Raw Knowledge Dump

Reference file for writing the SQLEnv blog post. Contains observations, training data, failure modes, and narrative threads extracted from 9 training runs. The blog outline is at `docs/blog-outline.md`, the draft at `docs/blog-post.md`.

## Training Run Summary

### Run progression (what each run taught us)
1. **Run 1**: SFT works, GRPO plateaus — no penalty for post-episode waste
2. **Run 2**: Qwen3 tokenizer expands dict args to null params — root cause of first collapse
3. **Run 3**: Without KL penalty, GRPO drifts structural tokens (`<tool_response>` instead of `<tool_call>`)
4. **Run 4**: KL penalty + reference model = OOM on L4
5. **Run 5**: KL too conservative with single-turn SFT — model only calls describe, never queries
6. **Run 6**: Multi-turn SFT breakthrough — first successful training, reward -0.1→0.7
7. **Run 7**: Repeat penalty, stable training, multi-table weakness exposed
8. **Run 8**: Thinking mode helps error recovery, introduces `<think>assistant` degenerate loop, OOM crash
9. **Run 9**: v2 continued training confirms ceiling — more epochs don't help medium questions

### Key numbers
| Metric | Value |
|--------|-------|
| Model | Qwen3-0.6B (target <0.5B per VISION.md, using 0.6B) |
| Hardware | Colab L4 (24GB VRAM) |
| SFT data | 120 multi-turn trajectories, assistant_only_loss |
| Phase 1 | 435 easy questions, beta=0.04, 512 tokens, ~3.5h |
| Phase 2 | 467 easy+medium, beta=0.0, 1024 tokens, ~3.8h |
| Total training | ~7.3h on L4 for both phases |
| Phase 1 reward | -0.1 → 0.5-0.7 (clear upward trend) |
| Phase 2 reward | oscillates 0.01-1.15, mean ~0.5 (no improvement over Phase 1) |
| Parse rate | >98% (valid tool-call JSON) |
| Eval accuracy (GRPO v1, N=50) | 28-30% across runs (vs 0% base model) |
| Eval accuracy (GRPO v2, N=50) | 24-32% across runs (within noise of v1) |
| Eval accuracy (base, 3-shot) | 0% (format compliance only 19-20%) |

## Concrete Episode Examples (for blog)

### Success: Simple aggregation (3 tool calls)
Question: "What is the total bonus given in all evaluations?"
1. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus columns
2. `query("SELECT SUM(Bonus) FROM evaluation")` → 19500.0
3. `answer("19500.0")` → correct
Reward: 1.15

### Success: Error recovery with JOIN
Question: "Which employee received the biggest bonus?"
1. `describe("employee")` → Employee_ID, Name, Age, City
2. `query("SELECT Name FROM employee ORDER BY Salary DESC LIMIT 1")` → Error: no such column: Salary
3. `describe("evaluation")` → Employee_ID, Year_awarded, Bonus
4. `query("SELECT Name FROM evaluation JOIN employee ON evaluation.Employee_ID = employee.Employee_ID ORDER BY Bonus DESC LIMIT 1")` → Louis Deacon
5. `answer("Louis Deacon")` → correct
Reward: 1.13

### Success: Subquery with comparison
Question: "How many cars have greater accelerate than the one with the most horsepower?"
1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year
2. `query("SELECT COUNT(*) FROM cars_data WHERE Accelerate > (SELECT Accelerate FROM cars_data ORDER BY Horsepower DESC LIMIT 1)")` → 39
3. `answer("39")` → correct
Reward: 1.15

### Failure: Column name hallucination
Question: "What are all the makers and models?"
1. `describe("cars_data")` → Id, MPG, Cylinders, Edispl, Horsepower, Weight, Accelerate, Year (no "model" column)
2. `query("SELECT Id, model FROM cars_data")` → Error: no such column: model
3. `describe("model_list")` → ModelId, Maker, Model
4. `query("SELECT Maker, Model FROM model_list")` → correct results (truncated to 20 rows)
5. `answer("1 | amc\n2 | audi\n3 | bmw...")` → incorrect (truncated, missing rows)
Model found the right data but answer was truncated by the 20-row display limit.

### Failure: FK chain reasoning
Question: "Which city has most number of arriving flights?"
1. `describe("AIRPORTS")` → City, AirportCode, AirportName, Country, CountryAbbrev
2. `query("SELECT CITY, COUNT(*) FROM AIRPORTS GROUP BY CITY ORDER BY COUNT(*) DESC LIMIT 1")` → Albany | 4
3. `answer("Albany")` → incorrect
Model counted airports per city, not flights per city. Needed to JOIN flights→airports on DestAirport=AirportCode, then GROUP BY City. The FK chain reasoning is missing.

### Failure: Invented table
Question: "What are all details of students who registered but did not attend any course?"
1. `describe("Student_Course_Registrations")` → student_id, course_id, registration_date
2. `query("SELECT ... FROM Student_Course_Registrations WHERE course_id NOT INSELECT ...")` → syntax error (missing space)
3. `describe("Course_Attendance")` → Error: table not found. Available: Student_Course_Attendance
Model invented "Course_Attendance" instead of reading the available table list. Cascading errors consumed the full step budget.

## Reward Architecture Details

### Three-layer structure
```
L1 Operational (every step):
  +0.02  exec_ok
  +0.01  new_info (unique SQL hash)
  -0.03  repeat penalty
  -0.02  step cost

L2 Progress (QUERY only):
  Delta from previous binned progress × 0.15
  Binned to {0, 0.25, 0.5, 0.75, 1.0}

L3 Terminal (ANSWER only):
  +1.0 correct, 0.0 wrong

Per-step clip: [-0.10, 0.15]
```

### Why potential-based shaping matters
- Ng et al. (1999): F(s,s') = Φ(s') - Φ(s) preserves optimal policy
- Our delta progress IS potential-based with γ=1
- Cumulative caps are NOT potential-based (depend on trajectory history)
- Without this guarantee, agents learn to farm exploration rewards

### Anti-farming mechanisms
- Hard budget (15 steps)
- Step cost (-0.02)
- Repeat penalty (-0.03)
- Terminal dominance (1.0 vs ~0.3 max exploration)
- Per-step clip [-0.10, 0.15]
- Post-episode penalty (-0.3)

## Eval Results (N=50, 2026-04-11)

### Comparison table (for blog, N=50 with retry, 2026-04-11, Run B)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Parse Fails | Budget Exhaust |
|--------|----------|------------|-----------|------------|-------------|----------------|
| zero-shot | 0.0% | 0.007 | 12.4 | 23.6% | 434 | 38 |
| 1-shot | 2.0% | 0.061 | 14.0 | 17.0% | 537 | 46 |
| 3-shot | 0.0% | 0.057 | 14.8 | 19.0% | 551 | 49 |
| GRPO v1 | 30.0% | 0.386 | 3.5 | 100.0% | 0 | 0 |
| GRPO v2 | 24.0% | 0.321 | 3.6 | 95.1% | 8 | 1 |

### Previous run (Run A, same day, same seed)
| Method | Accuracy | Avg Reward | Avg Steps | Parse Rate | Budget Exhaust |
|--------|----------|------------|-----------|------------|----------------|
| zero-shot | 0.0% | 0.016 | 10.8 | 28.1% | 31/50 |
| 1-shot | 0.0% | 0.031 | 14.8 | 15.6% | 49/50 |
| 3-shot | 0.0% | 0.041 | 13.8 | 20.3% | 44/50 |
| GRPO v1 | 28.0% | 0.355 | 4.0 | 95.0% | 2/50 |
| GRPO v2 | 32.0% | 0.400 | 3.7 | 87.1% | 2/50 |

### Run-to-run variation (important for blog)
v1 and v2 show similar accuracy with noise at N=50: v1 scored 28% then 30%, v2 scored 32% then 24%. The difference between checkpoints is **within run-to-run variation** (~6-8pp swing). For the blog, report both as "~28-32% accuracy" or "roughly 30%" rather than claiming one is better. The meaningful comparison is GRPO (~30%) vs base model (0-2%), not v1 vs v2.

The variation comes from: (1) temperature sampling during generation, (2) question selection randomness at N=50, (3) v2's "Task complete." abstention pattern — on borderline questions, whether v2 guesses or abstains varies by run, causing larger accuracy swings.

Note: parse failures no longer end episodes — model gets a no-op DESCRIBE and continues. This gives base models the same step budget as trained models, but they waste it on repeated parse failures (avg 11-15 steps vs GRPO's 3.5-4.0).

### Key observations from N=50 eval (with retry, 2 runs)
1. **~30% accuracy** for GRPO vs 0-2% for base model across all conditions. v1 and v2 are statistically indistinguishable (28-30% vs 24-32% across runs).
2. **Run-to-run variation is ~6-8pp** — v1 scored 28% then 30%, v2 scored 32% then 24%. At N=50, don't over-interpret small differences between checkpoints.
3. **Base model parse failure loop** — without episode termination on parse failure, base models burn their entire 15-step budget repeating the same non-tool-call output (e.g., "- Single value: []" 11 times). 46-49/50 1-shot episodes hit budget exhaustion.
4. **GRPO solves format compliance** — 95-100% parse rate (v1) vs 17-28% for base. The trained model almost always produces valid `<tool_call>` JSON.
5. **GRPO failure mode is SQL quality, not format** — episodes with correct tool-call format but wrong SQL/answer dominate GRPO failures.
6. **Extra turns don't help base models** — more steps just mean more repeated failures. The fundamental gap is format compliance, not exploration budget.
7. **1-shot occasionally gets lucky** — scored 2% in Run B (1/50 correct), 0% in Run A. At N=50, a single lucky episode swings accuracy by 2pp.

### v2 vs v1: similar accuracy, more parse failures — behavioral shift
Across two runs, v1 and v2 show overlapping accuracy ranges (28-30% vs 24-32%). The difference is within run-to-run variation at N=50. However, v2 consistently shows more parse failures (8-22 vs 0-8), revealing a behavioral shift from continued training:

- **v1 guesses more**: v1 almost always calls `answer()`, even when uncertain. It submits wrong answers confidently (0 parse failures in Run B, 100% parse rate).
- **v2 gives up on hard questions**: v2 produces "Task complete." output after multiple failed queries instead of calling `answer()`, producing parse failures. v2 learned that some questions are unsolvable.
- **Neither is clearly better**: v2's caution helps on some runs (32% in Run A) and hurts on others (24% in Run B). The abstention behavior adds variance. For the blog, present them as equivalent (~30%) with a qualitative note about the behavioral difference.

The v2 parse failure pattern (from raw output):
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE Population < ...
[PARSE FAIL] raw: Task complete.    ← gives up, doesn't call answer()
[PARSE FAIL] raw: Task complete.    ← repeats until budget
```

Compare v1 on the same type of question:
```
[OK] DESCRIBE: country
[OK] QUERY: SELECT Name FROM country WHERE ...
[OK] ANSWER: European cities and their names are: 42    ← wrong, but at least calls answer()
```

This is a form of **calibrated uncertainty** — v2 is better at knowing what it doesn't know. The incorrect answer reward of 0.0 (see learning #19 in session log) creates an avoid-answering incentive that v2 has partially internalized. A more generous incorrect-answer reward (e.g., +0.1 for attempting an answer in correct format) might recover these episodes.

### For the blog narrative
The story is clear: GRPO teaches format compliance (0% → 95-100% parse rate) and strategic tool use (describe→query→answer in 3-4 steps). Base models waste 15 steps repeating parse failures. The ~30% accuracy ceiling (consistent across checkpoints and runs) comes from the 0.6B model's SQL reasoning capacity, not from the environment or training pipeline. The environment scales; the model doesn't. Report v1 and v2 as "roughly 30%" — the variation between runs is larger than the difference between checkpoints.

## Format Mismatch Discovery (F011)

### The three differences between eval and training
1. **role:tool vs role:user** — Qwen3 renders `role:"tool"` as `<|im_start|>user\n<tool_response>...</tool_response>`, `role:"user"` as `<|im_start|>user\nplain text`. Same role token, different content structure.
2. **Structured tool_calls vs raw text** — Training uses `{"role":"assistant", "tool_calls":[{"function":{"name":"describe","arguments":"{...}"}}]}`, eval was using `{"role":"assistant", "content":"<tool_call>...</tool_call>"}`.
3. **No separator vs `\n\n`** — TRL appends `reset()` return directly to user message. Eval had `question\n\ntable_hint`.

### Impact
Before fix: 0% accuracy across ALL conditions (zero-shot, 1-shot, 3-shot, GRPO checkpoint).
After fix: 10% zero-shot, 30% 1-shot, 50% 3-shot on base model. GRPO checkpoint still 10%.

### Lesson
Eval format matching is not a nice-to-have. It's a prerequisite for ANY measurement. We spent time debugging model quality when the problem was plumbing.

## Multi-Turn SFT — Why It's Critical

### Per-turn SFT (broken)
- 347 examples, each one assistant turn
- ~50% were describe calls
- Model learned: "when asked a question, call describe"
- With KL penalty, model stayed anchored to this single-action policy
- Result: reward=0.00, all rollouts identical, advantage=0

### Multi-turn SFT (working)
- 120 examples, each a full describe→query→answer trajectory
- `assistant_only_loss` via Qwen3 template patch (`{% generation %}` tags)
- Model learned: the SEQUENCE describe→query→answer
- With KL penalty, model explores within the multi-turn strategy
- Result: reward climbs to 0.7 in Phase 1

### Template patch detail
Qwen3's chat template lacks `{% generation %}` tags needed by TRL for assistant_only_loss. We patch the template before SFT, restore original before GRPO (TRL does exact-match checks on template string in `add_response_schema()` and `get_training_chat_template()`).

## The 0.6B Capacity Ceiling

### What works at 0.6B
- Single-table queries: COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING, ORDER BY, LIMIT
- Simple JOINs between 2 tables when FK is obvious (evaluation.Employee_ID = employee.Employee_ID)
- WHERE with LIKE, IN, BETWEEN, NOT IN subqueries
- Answer formatting: comma lists, pipe-delimited rows, `[]` for empty
- Error recovery: describe after SQL error, retry with correct column names
- `sample` tool usage (learned in Run 6, inconsistent later)

### What breaks at 0.6B
- FK chain reasoning: 3+ table joins (Documents→Templates→Ref_Template_Types)
- Column name fidelity: reads `FullName` from describe, writes `full_name` in SQL
- Ambiguous column resolution: joins with same column name in both tables
- Complex subqueries: INTERSECT, EXCEPT, correlated subqueries with HAVING
- "stadium without concert" pattern: NOT IN with JOIN to get names
- Aggregate + GROUP BY + HAVING chains on multi-table joins

### The hallucination pattern
The model describes a table and sees the exact column names. Then it writes SQL using pretrained column names that don't match. This isn't a memory problem — the schema is in the context window. It's a weight problem — pretraining biases override in-context information at 0.6B scale.

## Thinking Mode Observations (Run 8)

### Benefits
- Reasons through SQL errors: "no such column: airport_code" → `<think>` block → tries `AirportCode`
- Empty `<think></think>` on easy questions — token-efficient, emergent behavior
- Multi-step join planning in think blocks

### New failure mode
~23% of rollouts: `<think>assistant<think>assistant...` repeating until token limit. Model fails to close `</think>` tag. Burns entire token budget with garbage.

### OOM risk
Thinking blocks consume more tokens → higher peak memory during generation. Phase 2 crashed at step 182/467 with max_new_tokens=1280. Fix: reduce to 1024, or reduce num_generations from 4 to 3.

## Narrative Threads for Blog

### "The environment is the product"
From VISION.md: "SQLEnv is a reinforcement learning environment — not a text-to-SQL model. The environment is the product." The trained agent demonstrates that the environment works, but the contribution is the action space, reward architecture, and episode structure.

### "Small model showing improvement proves more than large model with marginal gains"
A 0.6B model going from 0% to 10% accuracy with clear strategic behavior (describe→query→answer, error recovery) proves the environment produces learning signal. A 70B model with marginal gains would prove nothing about the environment.

### "Analysts don't write perfect queries from scratch"
The hook. Frame the problem as: text-to-SQL evaluates guessing, not investigating. SQLEnv evaluates the process.

### "Dense rewards need theory"
Potential-based shaping isn't just good practice — it's the guarantee that the agent optimizes for the right objective. Without it, we saw agents farming exploration rewards.

### "Multi-turn SFT teaches strategy, not actions"
The difference between per-turn and multi-turn SFT is the difference between teaching vocabulary and teaching conversation.

## References for Blog

- Ng, Harada, Russell (1999). Policy invariance under reward transformations. ICML.
- DeepSeek-AI (2025). DeepSeek-R1.
- Shao et al. (2024). DeepSeek-Math: GRPO.
- Sullivan et al. (2025/2026). GRPO is Secretly a Process Reward Model. ICLR 2026.
- Yu et al. (2018). Spider dataset.
- Li et al. (2023). BIRD benchmark.
- TIPS (2026). Turn-Level Information-Potential Reward Shaping.
- ToolRL (2025). Reward is All Tool Learning Needs.
- StepTool (2024). Step-grained RL for Tool Learning.

## Showcase Notebook Transcripts (for blog)

### Random agent episode (seed=7) — comedic failure
Question: "Count the number of paragraphs."
```
  SAMPLE Paragraphs         → reward=0.015
  SAMPLE Documents           → reward=0.015
  DESCRIBE Documents         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Templates         → reward=0.015
  SAMPLE Documents           → reward=0.015  (repeat)
  DESCRIBE Documents         → reward=0.015  (repeat)
  QUERY SELECT * FROM "Templates" LIMIT 5  → reward=0.0625
  DESCRIBE Documents         → reward=0.015  (repeat)
  DESCRIBE Paragraphs        → reward=0.015
  QUERY SELECT * FROM "Paragraphs" LIMIT 5 → reward=0.025
  QUERY SELECT * FROM "Documents" LIMIT 5  → reward=0.025
  ANSWER 76 | 20 | Robbin CV | y | None   → reward=0.000 (incorrect)
```
Total reward: 0.278. Used all 15 steps. Described Documents 5 times. Answered with a random row from the wrong table. Never wrote `SELECT COUNT(*)`.

### Oracle agent episode (seed=0) — clean solve
Question: "List the id of students who registered some courses and the number of their registered courses?"
```
  Step 1: DESCRIBE student_course_registrations
    → student_id INTEGER, course_id INTEGER, registration_date DATETIME
    → reward: +0.015

  Step 2: DESCRIBE students
    → student_id INTEGER, student_details VARCHAR(255)
    → reward: +0.015

  Step 3: QUERY
    SELECT T1.student_id, count(*)
    FROM students AS T1
    JOIN student_course_registrations AS T2
      ON T1.student_id = T2.student_id
    GROUP BY T1.student_id
    → 111|1, 121|2, 131|1, 141|2, 151|1, 161|1, 171|1
    → reward: +0.150

  Step 4: ANSWER [[111,1],[121,2],[131,1],[141,2],[151,1],[161,1],[171,1]]
    → correct
    → reward: +1.000
```
Total reward: 1.180. 4 steps, efficient. Exploration (L1+L2): 0.180, Terminal (L3): 1.000.

### Baseline comparison (50 episodes each)
| Policy | Success Rate | Avg Reward | Avg Steps |
|--------|-------------|------------|-----------|
| Random | 0.0% | 0.247 | 15.0 |
| Oracle | 100.0% | 1.168 | 3.5 |

The gap between 0.247 and 1.168 defines the learning space. A trained agent lands somewhere between.

### Reward constants (from server/reward.py)
```
+0.02   successful execution (no errors)
+0.01   new information (unique query)
-0.02   step cost (every action)
-0.03   repeat penalty (duplicate SQL)
[-0.10, +0.15]  per-step clipping range
+1.0    correct answer (terminal)
+0.0    wrong answer (terminal)
```
Terminal dominance: max exploration over 15 steps is ~0.3 (15 * 0.02 best case), while a correct answer adds 1.0.

## Competition Context

### OpenEnv Challenge (our target)
- Sponsors: PyTorch/Meta, HuggingFace, Unsloth
- Prize: $10K HF credits
- Judging: primarily blog-based
- Criteria: Creative OpenEnv use, Technical excellence, Storytelling, Open source demo, Green Agent wrapper
- Green Agent wrapper is an explicit judging criterion in the OpenEnv Challenge.

### Deliverables
1. Environment on HF Hub — **live** at https://huggingface.co/spaces/hjerpe/sql_env
   (pushed 2026-03-29; Docker image at `registry.hf.space/hjerpe-sql_env:latest`)
2. Training notebooks/scripts on GitHub — `notebooks/train_grpo.ipynb`,
   `notebooks/compare_methods.ipynb`, `notebooks/showcase_sqlenv.ipynb`
3. Blog on HuggingFace — `docs/blog-post-v1.md` (draft)

### TRL integration status (already done — do not re-research)
`training/trl_adapter.py::SQLEnvTRL` is a TRL-native `environment_factory`
class: `reset()` + named tool methods `describe() / sample() / query() /
answer()` with docstrings TRL uses to build the tool schema. The notebook
passes it directly: `GRPOTrainer(..., environment_factory=SQLEnvTRL,
reward_funcs=[sql_env_reward_func])`. The adapter runs `SQLEnvironment`
**in-process** (not a WebSocket client to the HF Space) — intentional, because
training opens N parallel sessions and the Space defaults to 1.

### Competitive landscape
- **SQL Repair** (WALKMAN303) — buggy SQL fix, simpler than our multi-turn exploration
- **Calendar Gym** (Turing) — featured on HF blog, real-world framing + failure analysis
- **OpenSec** — cybersecurity with arXiv paper, adversarial evidence injection
- Our position: no interactive SQL *exploration* environment exists. SQL Repair is single-turn fix-it; we're multi-turn strategy discovery.

### What winning entries do
1. Stakes framing — "this matters in production"
2. Concrete failure analysis with numbers
3. Contrast (random vs trained vs oracle)
4. Real data, not toy puzzles
5. Non-obvious insights from training

## Green Agent Evaluator

### What it is
OpenEnv's standardized evaluation wrapper pattern. A `Policy` protocol with `evaluate(env, policy, n_episodes, seed)` that runs any policy through the environment and reports aggregate metrics. Listed as an explicit judging criterion in the OpenEnv Challenge.

### Implementation
- `evaluation/policies.py` — `Policy` protocol, `evaluate()` harness, `RandomPolicy`, `EpisodeResult`, `EvaluationResult`
- `evaluation/oracle_policy.py` — `OraclePolicy` baseline (runs gold SQL)
- `tests/test_evaluation.py` — 17 tests, all passing (unit + integration)

### How it works
```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

# Run 50 episodes with random policy
result = evaluate(env, RandomPolicy(seed=0), n_episodes=50, seed=0)
print(f"Success: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

# Run with trained policy (any class with select_action method)
result = evaluate(env, trained_policy, n_episodes=50, seed=42)
```

### Where it's used
- `notebooks/showcase_sqlenv.ipynb` — Random vs Oracle baseline comparison
- `notebooks/compare_methods.ipynb` — All 5 conditions (zero-shot, 1-shot, 3-shot, GRPO v1, v2) run through `evaluate()`

### Key design choices
- **Error isolation**: one episode crashing doesn't kill the run — logged as `EpisodeResult(error=str(exc))`
- **Deterministic seeding**: `seed + episode_index` per episode for reproducibility
- **Protocol-based**: any class with `select_action(observation) -> action` works — no inheritance required
- **Aggregate + per-episode**: `EvaluationResult` has both summary metrics and full `episodes` list for drill-down

### For the blog
The Green Agent evaluator is the backbone of all evaluation. Every result in the comparison table flows through `evaluate()`. The trained GRPO model is wrapped in `LLMToolCallingPolicy` (which implements the `Policy` protocol) and evaluated identically to the Random and Oracle baselines. This is the standardized, reproducible evaluation pattern the challenge asks for.

## Files to Reference

| File | Relevance |
|------|-----------|
| `docs/blog-outline.md` | Section structure template |
| `docs/blog-post.md` | Current draft |
| `docs/design-docs/reward-shaping-research.md` | Reward theory + references |
| `docs/exploration/grpo-training-session-log.md` | All 9 runs detailed |
| `vision/VISION.md` | Product vision, success metrics |
| `training/trl_adapter.py` | Environment adapter code |
| `notebooks/compare_methods.ipynb` | Eval notebook |
| `notebooks/train_grpo.ipynb` | Training notebook |