sql_env / specs /F006-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
# Implementation Specification
**Change:** F006 -- GRPO Training Pipeline
**Date:** 2026-03-27
**Research Summary:** [specs/F006-RESEARCH_SUMMARY.md](F006-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Delta:** Archived to [specs/behavior/training.md](behavior/training.md)
**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed
---
## Core Intent (Immutable)
> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you are describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.
**User Problem:**
Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.
**Success Criteria:**
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results (deterministic given seed)
**Avoid:**
- Training that does not converge at all (no learning signal)
- Requiring an expensive GPU for hours to see any signal
- Notebook with hidden dependencies that break on fresh setup
**Out of Scope:**
- wandb / TensorBoard integration (MVP: print metrics)
- vLLM inference (use HF generate for simplicity)
- Hard-difficulty questions in training set (add later)
- WebSocket-based training (use local env)
- Multi-GPU / distributed training
- Custom RLHF algorithms beyond GRPO
---
## 0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in **small, mergeable increments**.
### Scope Budget
- Target: **3 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**
### Slice Definition
| Slice | Name | Value |
|-------|------|-------|
| S1 | Training Config + Prompts | Configurable training setup, system prompt for SQL agent |
| S2 | Rollout + Rewards | TRL-compatible rollout function and reward callables |
| S3 | Training Notebook | End-to-end notebook with learning curve and comparison |
## Status Icons
**Step Status:**
- !! Not Started
- >> In Progress
- OK Completed
- XX Blocked/Failed
**Result Outcome:**
- OK Fully Successful (all tests passed, no issues)
- !! Completed with Issues (needs follow-up)
- XX Failed/Blocked
---
## 1. Implementation Overview
### Summary
Add a `training/` subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a `notebooks/train_grpo.ipynb` notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.
### Scope
**In Scope:**
- `training/config.py` -- dataclass with all hyperparameters and model name
- `training/prompts.py` -- system prompt for SQL exploration agent
- `training/rollout.py` -- `rollout_func` that plays SQLEnv episodes via HF generate
- `training/rewards.py` -- reward callables matching TRL `reward_funcs` signature
- `notebooks/train_grpo.ipynb` -- end-to-end training notebook
- `training/__init__.py` -- public exports
**Out of Scope:**
- vLLM inference backend
- wandb/TensorBoard logging
- Training on hard-difficulty questions
- Distributed or multi-GPU training
---
## 1a. Execution Status
**Progress:** 6/6 steps complete
**Current Step:** None (implementation complete)
**Last Updated:** 2026-03-28T07:37:20Z
**Latest Result:** OK Fully Successful - Step 3.1 complete, 68/68 tests passed
**Blockers:** None
---
## 1b. Risk Assessment
**Risk Tier:** Medium
**Risk Tier Definitions:**
- **Low:** Pure logic, non-user-facing, no security implications
- **Medium:** User input handling, data validation, API changes
- **High:** Authentication, payments, secrets management, untrusted input
**High-Risk Indicators Present:** None
**Security Review Required:** No
**Justification:**
External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.
---
## 2. Change Manifest
### Files to Create
| File | Purpose |
|------|---------|
| `training/__init__.py` | Package init, public exports |
| `training/config.py` | `GRPOConfig` dataclass with hyperparameters |
| `training/prompts.py` | System prompt for SQL exploration agent |
| `training/rollout.py` | `rollout_func` for TRL GRPOTrainer |
| `training/rewards.py` | Reward callables: correctness, progress, operational |
| `training/data_loading.py` | Model/question loading helpers for notebook runtime and tests |
| `training/notebook_pipeline.py` | Notebook orchestration helpers for trainer setup, baseline, and metrics |
| `notebooks/train_grpo.ipynb` | End-to-end training notebook |
| `tests/integration/test_training_pipeline.py` | Integration verification for rollout + rewards pipeline |
| `tests/e2e/test_training_e2e.py` | Notebook smoke verification and pipeline behavior checks |
| `tests/unit/test_error_handling.py` | Error-path verification for model/questions loading and fallback logging |
### Files to Modify
| File | Changes |
|------|---------|
| `pyproject.toml` | Add `trl` and training optional dependency group |
### Files to Delete
None.
---
## 3. Interface Specifications
### New Types
```python
# Location: training/config.py
from dataclasses import dataclass, field
@dataclass
class GRPOConfig:
"""All hyperparameters for GRPO training on SQLEnv."""
# Model
model_name: str = "Qwen/Qwen3-1.7B"
max_new_tokens: int = 256
# Training
num_train_epochs: int = 1
per_device_train_batch_size: int = 2
gradient_accumulation_steps: int = 4
learning_rate: float = 5e-6
num_generations: int = 4 # G in GRPO (completions per prompt)
# Environment
questions_path: str = "data/questions/questions_train.json"
db_dir: str = "data/databases"
step_budget: int = 10 # Shorter budget for training
difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])
# Reproducibility
seed: int = 42
# Output
output_dir: str = "outputs/grpo_run"
logging_steps: int = 10
```
### New Functions
```python
# Location: training/prompts.py
def get_system_prompt() -> str:
"""Return the system prompt for the SQL exploration agent.
Returns:
System prompt string instructing the model on SQLEnv action format.
"""
def format_observation(obs: "SQLObservation") -> str:
"""Format an SQLObservation into a user-turn string for the model.
Args:
obs: The observation from the environment.
Returns:
Formatted string suitable as a user message in chat history.
"""
```
```python
# Location: training/rollout.py
from typing import Any
def rollout_func(
prompts: list[str],
model: Any,
tokenizer: Any,
config: "GRPOConfig",
) -> list[dict[str, Any]]:
"""Play SQLEnv episodes for a batch of question prompts.
Each prompt is a question text. The function:
1. Creates a local SQLEnvironment
2. Resets with the question
3. Loops: model.generate() -> parse action -> env.step()
4. Collects completions and metadata
Args:
prompts: List of question texts (from training dataset).
model: HuggingFace model for generation.
tokenizer: HuggingFace tokenizer.
config: Training configuration.
Returns:
List of dicts with keys:
- "prompt": str (the input prompt)
- "completion": str (full model output trajectory)
- "metadata": dict with episode_id, steps, done, answer_correct
"""
```
```python
# Location: training/rewards.py
def reward_correctness(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout (includes 'metadata' key).
Returns:
List of float rewards, one per completion.
"""
def reward_progress(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Progress reward: cumulative progress score from environment.
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout.
Returns:
List of float rewards, one per completion.
"""
def reward_operational(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout.
Returns:
List of float rewards, one per completion.
"""
```
---
## 4. Data Flow
### Primary Flow (Training Loop)
```
1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
- Input: config.model_name
- Output: model, tokenizer, config
2. Load training questions filtered by difficulty
- Input: config.questions_path, config.difficulty_filter
- Output: list[str] of question texts as prompts
3. GRPOTrainer calls rollout_func for each batch of prompts
- Input: prompts, model, tokenizer, config
- Action: For each prompt, play a full SQLEnv episode
a. Create local SQLEnvironment
b. env.reset(question) -> initial observation
c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
d. Collect full trajectory as completion string
- Output: completions + metadata (correctness, progress, operational signals)
4. GRPOTrainer calls each reward_func on completions
- Input: completions list, metadata kwargs
- Output: list[float] per reward function
5. GRPOTrainer computes GRPO loss and updates model weights
- Input: completions, rewards, model
- Output: updated model weights, logged metrics
6. Repeat steps 3-5 for num_train_epochs
```
### Alternative Flow: Unparseable Model Output
```
1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)
```
### Alternative Flow: Episode Exceeds Token Budget
```
1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
a. System prompt (always)
b. Most recent 3 observation-action pairs
3. Episode continues with truncated context
```
---
## 5. Error Handling
### Error Types
| Error | When | Strategy |
|-------|------|----------|
| `ModelLoadError` | Model not found on HuggingFace | Fail fast with clear message naming model_name |
| `ActionParseError` | Model output not parseable as SQLAction | Default to QUERY with raw text, log warning |
| `OOMError` | GPU out of memory during training | Print guidance: reduce batch_size or num_generations |
| `QuestionLoadError` | Questions file missing or empty | Fail fast with path in error message |
| `EnvironmentError` | SQLEnv database missing | Fail fast pointing to data download instructions |
### Error Handling Strategy
```python
# In rollout_func: graceful degradation
try:
action = parse_action(model_output)
except ActionParseError:
action = SQLAction(action_type="QUERY", argument=model_output)
# In notebook: fail-fast on setup
try:
model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")
```
### Retry Strategy
| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Model download | No | Fail fast, user must fix network/model name |
| Episode rollout | No | Single attempt per episode, errors become low-reward signal |
| Training step | No | OOM is fatal for that config, must adjust params |
---
## 6. Slice Plan (What we will ship, in order)
### Slice S1 -- Training Config + Prompts
**Value:** Centralized, documented configuration and system prompt ready for training integration
**User-visible change:** No (internal infrastructure)
**Interfaces introduced/changed:** `GRPOConfig`, `get_system_prompt()`, `format_observation()`
**Rollback safety:** Additive only -- new files, no existing code changed
### Slice S2 -- Rollout + Rewards
**Value:** TRL-compatible rollout and reward functions that can drive GRPO training
**User-visible change:** No (library code)
**Interfaces introduced/changed:** `rollout_func()`, `reward_correctness()`, `reward_progress()`, `reward_operational()`
**Rollback safety:** Additive only -- new files in training/ package
### Slice S3 -- Training Notebook
**Value:** Users can run one notebook to train a model and see before/after results
**User-visible change:** Yes -- the notebook is the primary deliverable
**Interfaces introduced/changed:** `notebooks/train_grpo.ipynb`, `pyproject.toml` training deps
**Rollback safety:** Notebook is standalone; pyproject.toml change is additive (optional deps group)
---
## 7. Implementation Steps
> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.
### Step 1.1: Training Config Dataclass
**Slice:** S1
**Goal:** Create `training/config.py` with `GRPOConfig` dataclass holding all hyperparameters.
**Files:**
- `training/__init__.py` - create - package init with public exports
- `training/config.py` - create - GRPOConfig dataclass
**Interface Changes:**
- New type: `GRPOConfig` with fields as specified in Section 3
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T06:44:31Z
**Changes Made:**
- Created `training/config.py` with `GRPOConfig` dataclass and input validation in `__post_init__`
- Created `training/__init__.py` exporting `GRPOConfig`
- Added `tests/unit/test_grpo_config.py` covering defaults, overrides, required fields, and validation failures
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
Result: 7 passed in 17.06s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py -v`
- **Notes:**
- Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
- `uv run pytest ...` failed because pytest is not installed by default; used `uv run --with pytest pytest ...` for scoped test dependency
- Kept config required fields (`questions_path`, `db_dir`, `output_dir`) positional/required per verification criteria
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- GRPOConfig available for import by prompts.py and rollout.py
---
### Step 1.2: System Prompt and Observation Formatter
**Slice:** S1
**Goal:** Create `training/prompts.py` with system prompt and observation formatting for model input.
**Files:**
- `training/prompts.py` - create - system prompt and observation formatter
**Interface Changes:**
- New functions: `get_system_prompt() -> str`, `format_observation(obs: SQLObservation) -> str`
**Details:**
- System prompt should instruct the model on:
- Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
- Action format: `ACTION_TYPE: argument`
- Exploration strategy guidance (describe tables first, then query, then answer)
- Budget awareness
- `format_observation` converts SQLObservation fields into a readable user-turn string
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T06:47:49Z
**Changes Made:**
- Created `training/prompts.py` with deterministic `get_system_prompt()` and `format_observation()` helpers
- Added truncation guard for long observation results to keep prompt payload bounded
- Updated `training/__init__.py` exports to include prompt helpers
- Added `tests/unit/test_prompts.py` covering prompt content and observation formatting edge cases
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_prompts.py -v
Result: 8 passed in 2.92s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_prompts.py -v`
- **Notes:**
- `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Prompt module ready for use in rollout.py
---
### Step 2.1: Action Parser Utility
**Slice:** S2
**Goal:** Create a robust parser that extracts `SQLAction` from free-form model output text.
**Files:**
- `training/rollout.py` - create - contains `parse_model_output(text: str) -> SQLAction`
**Interface Changes:**
- New function: `parse_model_output(text: str) -> SQLAction`
- Parses `ACTION_TYPE: argument` format from model text
- Falls back to `SQLAction(action_type="QUERY", argument=text)` on parse failure
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T06:51:50Z
**Changes Made:**
- Created `training/rollout.py` with `parse_model_output(text)` and a focused line parser helper
- Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
- Added robust fallback behavior to `SQLAction(action_type="QUERY", argument=<raw_text>)` on parse failure
- Added `tests/unit/test_rollout.py` with coverage for happy path, edge cases, multiline output, and fallback behavior
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
Result: 11 passed in 2.44s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
- `uv run pytest ...` failed because pytest is not installed in the base env; used `uv run --with pytest pytest ...` for scoped dependency execution
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- parse_model_output is available in `training/rollout.py` for Step 2.2 rollout integration
---
### Step 2.2: Rollout Function
**Slice:** S2
**Goal:** Implement `rollout_func` that plays full SQLEnv episodes using HF generate.
**Files:**
- `training/rollout.py` - modify - add `rollout_func` and `play_episode` helper
**Interface Changes:**
- New function: `rollout_func(prompts, model, tokenizer, config) -> list[dict]`
- New helper: `play_episode(question_text, model, tokenizer, config, env) -> dict`
- Creates local SQLEnvironment for the episode
- Loops: format obs -> generate -> parse -> step until done or budget exhausted
- Returns completion string and metadata dict
**Details:**
- Use `model.generate()` (HF native, not vLLM) for inference
- Build chat messages using tokenizer.apply_chat_template
- Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
- Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Medium
> Core integration point between model and environment -- most likely source of bugs.
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T07:04:59Z
**Changes Made:**
- Expanded `training/rollout.py` with `rollout_func`, `play_episode`, message-history truncation, prompt-aware environment reset, and HF `model.generate()` integration paths for both list and tensor-like outputs.
- Added rollout metadata fields (`episode_id`, `step_count`, `done`, `answer_correct`, `cumulative_progress`, `operational_signals`) and top-level compatibility keys (`content`, `correct`, `progress`, `operational`).
- Extended `tests/unit/test_rollout.py` with Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
Result: 21 passed in 2.58s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rollout.py -v`
- **Notes:**
- Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
- Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- rollout metadata now carries correctness/progress/operational signals needed by `training/rewards.py` in Step 2.3
---
### Step 2.3: Reward Functions
**Slice:** S2
**Goal:** Implement three TRL-compatible reward callables that consume rollout metadata.
**Files:**
- `training/rewards.py` - create - reward_correctness, reward_progress, reward_operational
**Interface Changes:**
- New functions (all with TRL reward_func signature):
- `reward_correctness(completions, **kwargs) -> list[float]`
- `reward_progress(completions, **kwargs) -> list[float]`
- `reward_operational(completions, **kwargs) -> list[float]`
**Details:**
- `reward_correctness`: Binary 1.0/0.0 based on metadata["answer_correct"]
- `reward_progress`: Float from metadata["cumulative_progress"], normalized to [0, 1]
- `reward_operational`: Sum of per-step operational signals from metadata["operational_signals"]
- All functions access metadata via kwargs (TRL passes extra data from rollout return)
- Each function must handle missing metadata gracefully (return 0.0)
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T07:07:32Z
**Changes Made:**
- Created `training/rewards.py` with TRL-compatible `reward_correctness`, `reward_progress`, and `reward_operational` callables
- Added robust metadata extraction paths so reward functions support both nested `metadata` payloads and flattened rollout kwargs
- Updated `training/__init__.py` exports for reward helper imports from the package root
- Added `tests/unit/test_rewards.py` covering correctness/progress/operational behavior across happy path, edge, and batch scenarios
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_rewards.py -v
Result: 19 passed in 3.35s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_rewards.py -v`
- **Notes:**
- Used `uv run --with pytest ...` because `pytest` is not available in the base environment.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- `training/` now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies in `pyproject.toml`
---
### Step 3.1: Training Notebook
**Slice:** S3
**Goal:** Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.
**Files:**
- `notebooks/train_grpo.ipynb` - create - end-to-end training notebook
- `pyproject.toml` - modify - add `[project.optional-dependencies] training` group
**Interface Changes:**
- New optional dependency group: `training = ["trl>=0.12.0", "accelerate>=0.34.0"]`
**Details:**
Notebook cells (linear flow):
1. **Setup**: Install dependencies, import modules, set seed
2. **Config**: Instantiate GRPOConfig (users can override model_name here)
3. **Load Model**: `AutoModelForCausalLM.from_pretrained(config.model_name)`
4. **Load Dataset**: Load questions, filter by difficulty, format as prompts
5. **Initialize GRPOTrainer**: Pass model, tokenizer, rollout_func, reward_funcs, config
6. **Train**: `trainer.train()` with progress bar and metric printing
7. **Learning Curve**: Plot reward over training steps (matplotlib)
8. **Comparison**: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
9. **Save**: Save trained model to config.output_dir
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Medium
> User-facing deliverable; must work on fresh setup.
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** OK Completed
**Completed:** 2026-03-28T07:37:20Z
**Changes Made:**
- Created `notebooks/train_grpo.ipynb` as the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps.
- Added `[project.optional-dependencies].training` in `pyproject.toml` with `trl>=0.14.0,<0.15.0` and `accelerate>=0.34.0` to keep TRL/torch compatibility stable for this repository.
- Added `training/data_loading.py` to centralize notebook error handling for model loading and question filtering/loading.
- Added `training/notebook_pipeline.py` to centralize trainer wiring, random baseline generation, training execution, and metrics extraction.
- Updated `training/__init__.py` exports to include notebook-facing helpers.
- Added `tests/e2e/test_training_e2e.py` for notebook smoke structure + pipeline behavior checks.
- Added `tests/integration/test_training_pipeline.py` for rollout/reward integration scenarios.
- Added `tests/unit/test_error_handling.py` for model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.
**Result:**
- **Outcome:** OK Fully Successful
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
Result: 68 passed in 5.79s
Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"
Result: ok
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`
- **Notes:**
- Added concrete integration/e2e/error test files that were listed in `VERIFICATION_SPEC.md` but missing from repository.
- Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
- Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.
---
## 8. Rollout Considerations
### Feature Flags
- [ ] Required: No
### Migration
- [ ] Data migration needed: No
### Rollback Plan
All changes are additive (new `training/` package and `notebooks/` directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.
---
## 9. Execution Tracking
All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history
---
## 9a. Slice Completion Protocol
After all steps in a slice pass verification:
1. **Run verifier subagent** for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
2. **Run compound-engineer subagent** to extract learnings
- **Mandatory invocation** after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
3. **Commit** the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
4. **Continue to next slice** (if more slices remain)
- Or proceed to final verification if all slices complete
**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.
---
## 10. User Value Summary
**Status:** Generated
### What Users Can Now Do
Users can now run a single notebook (`notebooks/train_grpo.ipynb`) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.
### How to Access/Test
1. Install training extras: `uv sync --extra training`
2. Open `notebooks/train_grpo.ipynb`
3. Run all cells to train and save artifacts to `outputs/grpo_run`
### Demo
- **Command:** `jupyter notebook notebooks/train_grpo.ipynb`
- **Verification command:** `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v`
### Release Notes Snippet
Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.
---
## 11. PR Contract (Auto-Generated by autocode-next-step)
**Status:** Generated
### Scope
- Finalized Step 3.1 (Training Notebook) for F006.
- Added training optional dependency group in `pyproject.toml` with TRL pin compatible with repo torch version.
- Added notebook support helpers for model/question loading and trainer orchestration.
- Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.
### Files Changed
- `pyproject.toml`
- `notebooks/train_grpo.ipynb`
- `training/__init__.py`
- `training/data_loading.py`
- `training/notebook_pipeline.py`
- `training/rollout.py`
- `tests/e2e/test_training_e2e.py`
- `tests/integration/test_training_pipeline.py`
- `tests/unit/test_error_handling.py`
- `specs/F006-IMPLEMENTATION_SPEC.md`
- `specs/behavior/training.md`
### Verification Evidence
- `uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v` -> 68 passed
- `uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"` -> ok
- Verifier verdict: APPROVED (`specs/F006-VERIFICATION_REPORT.md`)
### Risk and Rollback
- Risk tier: Medium (training dependencies and user-facing notebook workflow).
- Rollback: remove notebook/training helper additions and revert `pyproject.toml` training extra.
### Ready for Next Command
All implementation and verification criteria for F006 are complete. Run `/commit-push-pr` when ready.
---
## Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
---
## Human Checkpoint
**Before handing to AI agent:**
- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated
**Questions:**
1. Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
2. Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.
---
## Handoff Notes
**For the implementing AI agent:**
```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
- HF generate (not vLLM) for inference
- Model name is a config parameter (default Qwen3-1.7B)
- Start with easy+medium questions only
- Follow TRL GRPOTrainer Wordle tutorial pattern
- reward_funcs are separate callables
```
---
*Specification completed: 2026-03-27*
*Approved by: [pending]*
*Verification spec: VERIFICATION_SPEC.md*
*Target agent: Claude Code*