Implementation Specification
Change: F006 -- GRPO Training Pipeline Date: 2026-03-27 Research Summary: specs/F006-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived to specs/behavior/training.md
Plan Status:
- Draft
- Approved for Implementation
- Implementation Complete
- Verification Passed
Core Intent (Immutable)
DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you are describing a different feature. If refinement reveals the need to change this section, create a new feature instead.
User Problem: Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.
Success Criteria:
- Training notebook runs end-to-end in one click
- Learning curve clearly shows improvement over episodes
- Side-by-side episode transcripts: random vs trained
- Reproducible results (deterministic given seed)
Avoid:
- Training that does not converge at all (no learning signal)
- Requiring an expensive GPU for hours to see any signal
- Notebook with hidden dependencies that break on fresh setup
Out of Scope:
- wandb / TensorBoard integration (MVP: print metrics)
- vLLM inference (use HF generate for simplicity)
- Hard-difficulty questions in training set (add later)
- WebSocket-based training (use local env)
- Multi-GPU / distributed training
- Custom RLHF algorithms beyond GRPO
0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in small, mergeable increments.
Scope Budget
- Target: 3 slices
- Hard max: <= 10 steps total
- Each step must end in: implement -> verify -> merge
Slice Definition
| Slice | Name | Value |
|---|---|---|
| S1 | Training Config + Prompts | Configurable training setup, system prompt for SQL agent |
| S2 | Rollout + Rewards | TRL-compatible rollout function and reward callables |
| S3 | Training Notebook | End-to-end notebook with learning curve and comparison |
Status Icons
Step Status:
- !! Not Started
In Progress
- OK Completed
- XX Blocked/Failed
Result Outcome:
- OK Fully Successful (all tests passed, no issues)
- !! Completed with Issues (needs follow-up)
- XX Failed/Blocked
1. Implementation Overview
Summary
Add a training/ subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a notebooks/train_grpo.ipynb notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.
Scope
In Scope:
training/config.py-- dataclass with all hyperparameters and model nametraining/prompts.py-- system prompt for SQL exploration agenttraining/rollout.py--rollout_functhat plays SQLEnv episodes via HF generatetraining/rewards.py-- reward callables matching TRLreward_funcssignaturenotebooks/train_grpo.ipynb-- end-to-end training notebooktraining/__init__.py-- public exports
Out of Scope:
- vLLM inference backend
- wandb/TensorBoard logging
- Training on hard-difficulty questions
- Distributed or multi-GPU training
1a. Execution Status
Progress: 6/6 steps complete Current Step: None (implementation complete) Last Updated: 2026-03-28T07:37:20Z Latest Result: OK Fully Successful - Step 3.1 complete, 68/68 tests passed Blockers: None
1b. Risk Assessment
Risk Tier: Medium
Risk Tier Definitions:
- Low: Pure logic, non-user-facing, no security implications
- Medium: User input handling, data validation, API changes
- High: Authentication, payments, secrets management, untrusted input
High-Risk Indicators Present: None
Security Review Required: No
Justification: External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.
2. Change Manifest
Files to Create
| File | Purpose |
|---|---|
training/__init__.py |
Package init, public exports |
training/config.py |
GRPOConfig dataclass with hyperparameters |
training/prompts.py |
System prompt for SQL exploration agent |
training/rollout.py |
rollout_func for TRL GRPOTrainer |
training/rewards.py |
Reward callables: correctness, progress, operational |
training/data_loading.py |
Model/question loading helpers for notebook runtime and tests |
training/notebook_pipeline.py |
Notebook orchestration helpers for trainer setup, baseline, and metrics |
notebooks/train_grpo.ipynb |
End-to-end training notebook |
tests/integration/test_training_pipeline.py |
Integration verification for rollout + rewards pipeline |
tests/e2e/test_training_e2e.py |
Notebook smoke verification and pipeline behavior checks |
tests/unit/test_error_handling.py |
Error-path verification for model/questions loading and fallback logging |
Files to Modify
| File | Changes |
|---|---|
pyproject.toml |
Add trl and training optional dependency group |
Files to Delete
None.
3. Interface Specifications
New Types
# Location: training/config.py
from dataclasses import dataclass, field
@dataclass
class GRPOConfig:
"""All hyperparameters for GRPO training on SQLEnv."""
# Model
model_name: str = "Qwen/Qwen3-1.7B"
max_new_tokens: int = 256
# Training
num_train_epochs: int = 1
per_device_train_batch_size: int = 2
gradient_accumulation_steps: int = 4
learning_rate: float = 5e-6
num_generations: int = 4 # G in GRPO (completions per prompt)
# Environment
questions_path: str = "data/questions/questions_train.json"
db_dir: str = "data/databases"
step_budget: int = 10 # Shorter budget for training
difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])
# Reproducibility
seed: int = 42
# Output
output_dir: str = "outputs/grpo_run"
logging_steps: int = 10
New Functions
# Location: training/prompts.py
def get_system_prompt() -> str:
"""Return the system prompt for the SQL exploration agent.
Returns:
System prompt string instructing the model on SQLEnv action format.
"""
def format_observation(obs: "SQLObservation") -> str:
"""Format an SQLObservation into a user-turn string for the model.
Args:
obs: The observation from the environment.
Returns:
Formatted string suitable as a user message in chat history.
"""
# Location: training/rollout.py
from typing import Any
def rollout_func(
prompts: list[str],
model: Any,
tokenizer: Any,
config: "GRPOConfig",
) -> list[dict[str, Any]]:
"""Play SQLEnv episodes for a batch of question prompts.
Each prompt is a question text. The function:
1. Creates a local SQLEnvironment
2. Resets with the question
3. Loops: model.generate() -> parse action -> env.step()
4. Collects completions and metadata
Args:
prompts: List of question texts (from training dataset).
model: HuggingFace model for generation.
tokenizer: HuggingFace tokenizer.
config: Training configuration.
Returns:
List of dicts with keys:
- "prompt": str (the input prompt)
- "completion": str (full model output trajectory)
- "metadata": dict with episode_id, steps, done, answer_correct
"""
# Location: training/rewards.py
def reward_correctness(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout (includes 'metadata' key).
Returns:
List of float rewards, one per completion.
"""
def reward_progress(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Progress reward: cumulative progress score from environment.
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout.
Returns:
List of float rewards, one per completion.
"""
def reward_operational(
completions: list[list[dict[str, str]]],
**kwargs: Any,
) -> list[float]:
"""Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).
Args:
completions: Batch of completion message lists (TRL format).
**kwargs: Additional metadata from rollout.
Returns:
List of float rewards, one per completion.
"""
4. Data Flow
Primary Flow (Training Loop)
1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
- Input: config.model_name
- Output: model, tokenizer, config
2. Load training questions filtered by difficulty
- Input: config.questions_path, config.difficulty_filter
- Output: list[str] of question texts as prompts
3. GRPOTrainer calls rollout_func for each batch of prompts
- Input: prompts, model, tokenizer, config
- Action: For each prompt, play a full SQLEnv episode
a. Create local SQLEnvironment
b. env.reset(question) -> initial observation
c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
d. Collect full trajectory as completion string
- Output: completions + metadata (correctness, progress, operational signals)
4. GRPOTrainer calls each reward_func on completions
- Input: completions list, metadata kwargs
- Output: list[float] per reward function
5. GRPOTrainer computes GRPO loss and updates model weights
- Input: completions, rewards, model
- Output: updated model weights, logged metrics
6. Repeat steps 3-5 for num_train_epochs
Alternative Flow: Unparseable Model Output
1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)
Alternative Flow: Episode Exceeds Token Budget
1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
a. System prompt (always)
b. Most recent 3 observation-action pairs
3. Episode continues with truncated context
5. Error Handling
Error Types
| Error | When | Strategy |
|---|---|---|
ModelLoadError |
Model not found on HuggingFace | Fail fast with clear message naming model_name |
ActionParseError |
Model output not parseable as SQLAction | Default to QUERY with raw text, log warning |
OOMError |
GPU out of memory during training | Print guidance: reduce batch_size or num_generations |
QuestionLoadError |
Questions file missing or empty | Fail fast with path in error message |
EnvironmentError |
SQLEnv database missing | Fail fast pointing to data download instructions |
Error Handling Strategy
# In rollout_func: graceful degradation
try:
action = parse_action(model_output)
except ActionParseError:
action = SQLAction(action_type="QUERY", argument=model_output)
# In notebook: fail-fast on setup
try:
model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")
Retry Strategy
| Operation | Retry? | Strategy |
|---|---|---|
| Model download | No | Fail fast, user must fix network/model name |
| Episode rollout | No | Single attempt per episode, errors become low-reward signal |
| Training step | No | OOM is fatal for that config, must adjust params |
6. Slice Plan (What we will ship, in order)
Slice S1 -- Training Config + Prompts
Value: Centralized, documented configuration and system prompt ready for training integration
User-visible change: No (internal infrastructure)
Interfaces introduced/changed: GRPOConfig, get_system_prompt(), format_observation()
Rollback safety: Additive only -- new files, no existing code changed
Slice S2 -- Rollout + Rewards
Value: TRL-compatible rollout and reward functions that can drive GRPO training
User-visible change: No (library code)
Interfaces introduced/changed: rollout_func(), reward_correctness(), reward_progress(), reward_operational()
Rollback safety: Additive only -- new files in training/ package
Slice S3 -- Training Notebook
Value: Users can run one notebook to train a model and see before/after results
User-visible change: Yes -- the notebook is the primary deliverable
Interfaces introduced/changed: notebooks/train_grpo.ipynb, pyproject.toml training deps
Rollback safety: Notebook is standalone; pyproject.toml change is additive (optional deps group)
7. Implementation Steps
VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.
Step 1.1: Training Config Dataclass
Slice: S1
Goal: Create training/config.py with GRPOConfig dataclass holding all hyperparameters.
Files:
training/__init__.py- create - package init with public exportstraining/config.py- create - GRPOConfig dataclass
Interface Changes:
- New type:
GRPOConfigwith fields as specified in Section 3
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-28T06:44:31Z Changes Made:
- Created
training/config.pywithGRPOConfigdataclass and input validation in__post_init__ - Created
training/__init__.pyexportingGRPOConfig - Added
tests/unit/test_grpo_config.pycovering defaults, overrides, required fields, and validation failures
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v Result: 7 passed in 17.06s - Tests run:
uv run --with pytest pytest tests/unit/test_grpo_config.py -v - Notes:
- Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
uv run pytest ...failed because pytest is not installed by default; useduv run --with pytest pytest ...for scoped test dependency- Kept config required fields (
questions_path,db_dir,output_dir) positional/required per verification criteria
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- GRPOConfig available for import by prompts.py and rollout.py
Step 1.2: System Prompt and Observation Formatter
Slice: S1
Goal: Create training/prompts.py with system prompt and observation formatting for model input.
Files:
training/prompts.py- create - system prompt and observation formatter
Interface Changes:
- New functions:
get_system_prompt() -> str,format_observation(obs: SQLObservation) -> str
Details:
- System prompt should instruct the model on:
- Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
- Action format:
ACTION_TYPE: argument - Exploration strategy guidance (describe tables first, then query, then answer)
- Budget awareness
format_observationconverts SQLObservation fields into a readable user-turn string
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-28T06:47:49Z Changes Made:
- Created
training/prompts.pywith deterministicget_system_prompt()andformat_observation()helpers - Added truncation guard for long observation results to keep prompt payload bounded
- Updated
training/__init__.pyexports to include prompt helpers - Added
tests/unit/test_prompts.pycovering prompt content and observation formatting edge cases
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_prompts.py -v Result: 8 passed in 2.92s - Tests run:
uv run --with pytest pytest tests/unit/test_prompts.py -v - Notes:
uv run pytest ...failed because pytest is not installed in the base env; useduv run --with pytest pytest ...for scoped dependency execution
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- Prompt module ready for use in rollout.py
Step 2.1: Action Parser Utility
Slice: S2
Goal: Create a robust parser that extracts SQLAction from free-form model output text.
Files:
training/rollout.py- create - containsparse_model_output(text: str) -> SQLAction
Interface Changes:
- New function:
parse_model_output(text: str) -> SQLAction- Parses
ACTION_TYPE: argumentformat from model text - Falls back to
SQLAction(action_type="QUERY", argument=text)on parse failure
- Parses
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-28T06:51:50Z Changes Made:
- Created
training/rollout.pywithparse_model_output(text)and a focused line parser helper - Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
- Added robust fallback behavior to
SQLAction(action_type="QUERY", argument=<raw_text>)on parse failure - Added
tests/unit/test_rollout.pywith coverage for happy path, edge cases, multiline output, and fallback behavior
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_rollout.py -v Result: 11 passed in 2.44s - Tests run:
uv run --with pytest pytest tests/unit/test_rollout.py -v - Notes:
uv run pytest ...failed because pytest is not installed in the base env; useduv run --with pytest pytest ...for scoped dependency execution
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- parse_model_output is available in
training/rollout.pyfor Step 2.2 rollout integration
Step 2.2: Rollout Function
Slice: S2
Goal: Implement rollout_func that plays full SQLEnv episodes using HF generate.
Files:
training/rollout.py- modify - addrollout_funcandplay_episodehelper
Interface Changes:
- New function:
rollout_func(prompts, model, tokenizer, config) -> list[dict] - New helper:
play_episode(question_text, model, tokenizer, config, env) -> dict- Creates local SQLEnvironment for the episode
- Loops: format obs -> generate -> parse -> step until done or budget exhausted
- Returns completion string and metadata dict
Details:
- Use
model.generate()(HF native, not vLLM) for inference - Build chat messages using tokenizer.apply_chat_template
- Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
- Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Medium
Core integration point between model and environment -- most likely source of bugs.
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-28T07:04:59Z Changes Made:
- Expanded
training/rollout.pywithrollout_func,play_episode, message-history truncation, prompt-aware environment reset, and HFmodel.generate()integration paths for both list and tensor-like outputs. - Added rollout metadata fields (
episode_id,step_count,done,answer_correct,cumulative_progress,operational_signals) and top-level compatibility keys (content,correct,progress,operational). - Extended
tests/unit/test_rollout.pywith Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_rollout.py -v Result: 21 passed in 2.58s - Tests run:
uv run --with pytest pytest tests/unit/test_rollout.py -v - Notes:
- Used
uv run --with pytest ...becausepytestis not available in the base environment. - Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
- Used
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- rollout metadata now carries correctness/progress/operational signals needed by
training/rewards.pyin Step 2.3
Step 2.3: Reward Functions
Slice: S2 Goal: Implement three TRL-compatible reward callables that consume rollout metadata.
Files:
training/rewards.py- create - reward_correctness, reward_progress, reward_operational
Interface Changes:
- New functions (all with TRL reward_func signature):
reward_correctness(completions, **kwargs) -> list[float]reward_progress(completions, **kwargs) -> list[float]reward_operational(completions, **kwargs) -> list[float]
Details:
reward_correctness: Binary 1.0/0.0 based on metadata["answer_correct"]reward_progress: Float from metadata["cumulative_progress"], normalized to [0, 1]reward_operational: Sum of per-step operational signals from metadata["operational_signals"]- All functions access metadata via kwargs (TRL passes extra data from rollout return)
- Each function must handle missing metadata gracefully (return 0.0)
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Low
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-28T07:07:32Z Changes Made:
- Created
training/rewards.pywith TRL-compatiblereward_correctness,reward_progress, andreward_operationalcallables - Added robust metadata extraction paths so reward functions support both nested
metadatapayloads and flattened rollout kwargs - Updated
training/__init__.pyexports for reward helper imports from the package root - Added
tests/unit/test_rewards.pycovering correctness/progress/operational behavior across happy path, edge, and batch scenarios
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_rewards.py -v Result: 19 passed in 3.35s - Tests run:
uv run --with pytest pytest tests/unit/test_rewards.py -v - Notes:
- Used
uv run --with pytest ...becausepytestis not available in the base environment.
- Used
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
training/now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies inpyproject.toml
Step 3.1: Training Notebook
Slice: S3 Goal: Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.
Files:
notebooks/train_grpo.ipynb- create - end-to-end training notebookpyproject.toml- modify - add[project.optional-dependencies] traininggroup
Interface Changes:
- New optional dependency group:
training = ["trl>=0.12.0", "accelerate>=0.34.0"]
Details: Notebook cells (linear flow):
- Setup: Install dependencies, import modules, set seed
- Config: Instantiate GRPOConfig (users can override model_name here)
- Load Model:
AutoModelForCausalLM.from_pretrained(config.model_name) - Load Dataset: Load questions, filter by difficulty, format as prompts
- Initialize GRPOTrainer: Pass model, tokenizer, rollout_func, reward_funcs, config
- Train:
trainer.train()with progress bar and metric printing - Learning Curve: Plot reward over training steps (matplotlib)
- Comparison: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
- Save: Save trained model to config.output_dir
Verification:
See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
Risk Tier for This Step: Medium
User-facing deliverable; must work on fresh setup.
Merge Criteria:
- Tests from VERIFICATION_SPEC.md pass
- No TODOs left in changed code (or explicitly tracked)
- Backwards compatible (or flag/migration documented)
Status: OK Completed
Completed: 2026-03-28T07:37:20Z Changes Made:
- Created
notebooks/train_grpo.ipynbas the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps. - Added
[project.optional-dependencies].traininginpyproject.tomlwithtrl>=0.14.0,<0.15.0andaccelerate>=0.34.0to keep TRL/torch compatibility stable for this repository. - Added
training/data_loading.pyto centralize notebook error handling for model loading and question filtering/loading. - Added
training/notebook_pipeline.pyto centralize trainer wiring, random baseline generation, training execution, and metrics extraction. - Updated
training/__init__.pyexports to include notebook-facing helpers. - Added
tests/e2e/test_training_e2e.pyfor notebook smoke structure + pipeline behavior checks. - Added
tests/integration/test_training_pipeline.pyfor rollout/reward integration scenarios. - Added
tests/unit/test_error_handling.pyfor model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.
Result:
- Outcome: OK Fully Successful
- Evidence Captured:
Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v Result: 68 passed in 5.79s Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')" Result: ok - Tests run:
uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v - Notes:
- Added concrete integration/e2e/error test files that were listed in
VERIFICATION_SPEC.mdbut missing from repository. - Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
- Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
- Added concrete integration/e2e/error test files that were listed in
- Issues: None
- Follow-ups Created: None
- Human Review Completed: N/A
Context for Next Step:
- All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.
8. Rollout Considerations
Feature Flags
- Required: No
Migration
- Data migration needed: No
Rollback Plan
All changes are additive (new training/ package and notebooks/ directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.
9. Execution Tracking
All execution state is tracked within this document:
- Section 1a: Overall progress summary
- Section 7: Per-step completion details, test results, and handoff context
- FEATURES.json: Feature-level status/progress metadata used by
/autocode-next-stepandopencode-ctx ralph run - Git history: Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's
progressandstatusfields inFEATURES.json - Running
git log --oneline IMPLEMENTATION_SPEC.mdfor change history
9a. Slice Completion Protocol
After all steps in a slice pass verification:
Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete
Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.
10. User Value Summary
Status: Generated
What Users Can Now Do
Users can now run a single notebook (notebooks/train_grpo.ipynb) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.
How to Access/Test
- Install training extras:
uv sync --extra training - Open
notebooks/train_grpo.ipynb - Run all cells to train and save artifacts to
outputs/grpo_run
Demo
- Command:
jupyter notebook notebooks/train_grpo.ipynb - Verification command:
uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
Release Notes Snippet
Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.
11. PR Contract (Auto-Generated by autocode-next-step)
Status: Generated
Scope
- Finalized Step 3.1 (Training Notebook) for F006.
- Added training optional dependency group in
pyproject.tomlwith TRL pin compatible with repo torch version. - Added notebook support helpers for model/question loading and trainer orchestration.
- Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.
Files Changed
pyproject.tomlnotebooks/train_grpo.ipynbtraining/__init__.pytraining/data_loading.pytraining/notebook_pipeline.pytraining/rollout.pytests/e2e/test_training_e2e.pytests/integration/test_training_pipeline.pytests/unit/test_error_handling.pyspecs/F006-IMPLEMENTATION_SPEC.mdspecs/behavior/training.md
Verification Evidence
uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v-> 68 passeduv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"-> ok- Verifier verdict: APPROVED (
specs/F006-VERIFICATION_REPORT.md)
Risk and Rollback
- Risk tier: Medium (training dependencies and user-facing notebook workflow).
- Rollback: remove notebook/training helper additions and revert
pyproject.tomltraining extra.
Ready for Next Command
All implementation and verification criteria for F006 are complete. Run /commit-push-pr when ready.
Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than 3 files in unrelated areas
- You need to introduce multiple new abstractions "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
Human Checkpoint
Before handing to AI agent:
- Interface specifications are complete
- Data flow is accurate
- Error handling is specified
- Implementation order makes sense
- VERIFICATION_SPEC.md has been generated
Questions:
- Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
- Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.
Handoff Notes
For the implementing AI agent:
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
- HF generate (not vLLM) for inference
- Model name is a config parameter (default Qwen3-1.7B)
- Start with easy+medium questions only
- Follow TRL GRPOTrainer Wordle tutorial pattern
- reward_funcs are separate callables
Specification completed: 2026-03-27 Approved by: [pending] Verification spec: VERIFICATION_SPEC.md Target agent: Claude Code