sql_env / specs /F006-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Implementation Specification

Change: F006 -- GRPO Training Pipeline Date: 2026-03-27 Research Summary: specs/F006-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived to specs/behavior/training.md

Plan Status:

  • Draft
  • Approved for Implementation
  • Implementation Complete
  • Verification Passed

Core Intent (Immutable)

DO NOT MODIFY THIS SECTION DURING REFINEMENT Changes to Core Intent mean you are describing a different feature. If refinement reveals the need to change this section, create a new feature instead.

User Problem: Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.

Success Criteria:

  • Training notebook runs end-to-end in one click
  • Learning curve clearly shows improvement over episodes
  • Side-by-side episode transcripts: random vs trained
  • Reproducible results (deterministic given seed)

Avoid:

  • Training that does not converge at all (no learning signal)
  • Requiring an expensive GPU for hours to see any signal
  • Notebook with hidden dependencies that break on fresh setup

Out of Scope:

  • wandb / TensorBoard integration (MVP: print metrics)
  • vLLM inference (use HF generate for simplicity)
  • Hard-difficulty questions in training set (add later)
  • WebSocket-based training (use local env)
  • Multi-GPU / distributed training
  • Custom RLHF algorithms beyond GRPO

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

  • Target: 3 slices
  • Hard max: <= 10 steps total
  • Each step must end in: implement -> verify -> merge

Slice Definition

Slice Name Value
S1 Training Config + Prompts Configurable training setup, system prompt for SQL agent
S2 Rollout + Rewards TRL-compatible rollout function and reward callables
S3 Training Notebook End-to-end notebook with learning curve and comparison

Status Icons

Step Status:

  • !! Not Started
  • In Progress

  • OK Completed
  • XX Blocked/Failed

Result Outcome:

  • OK Fully Successful (all tests passed, no issues)
  • !! Completed with Issues (needs follow-up)
  • XX Failed/Blocked

1. Implementation Overview

Summary

Add a training/ subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a notebooks/train_grpo.ipynb notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.

Scope

In Scope:

  • training/config.py -- dataclass with all hyperparameters and model name
  • training/prompts.py -- system prompt for SQL exploration agent
  • training/rollout.py -- rollout_func that plays SQLEnv episodes via HF generate
  • training/rewards.py -- reward callables matching TRL reward_funcs signature
  • notebooks/train_grpo.ipynb -- end-to-end training notebook
  • training/__init__.py -- public exports

Out of Scope:

  • vLLM inference backend
  • wandb/TensorBoard logging
  • Training on hard-difficulty questions
  • Distributed or multi-GPU training

1a. Execution Status

Progress: 6/6 steps complete Current Step: None (implementation complete) Last Updated: 2026-03-28T07:37:20Z Latest Result: OK Fully Successful - Step 3.1 complete, 68/68 tests passed Blockers: None


1b. Risk Assessment

Risk Tier: Medium

Risk Tier Definitions:

  • Low: Pure logic, non-user-facing, no security implications
  • Medium: User input handling, data validation, API changes
  • High: Authentication, payments, secrets management, untrusted input

High-Risk Indicators Present: None

Security Review Required: No

Justification: External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.


2. Change Manifest

Files to Create

File Purpose
training/__init__.py Package init, public exports
training/config.py GRPOConfig dataclass with hyperparameters
training/prompts.py System prompt for SQL exploration agent
training/rollout.py rollout_func for TRL GRPOTrainer
training/rewards.py Reward callables: correctness, progress, operational
training/data_loading.py Model/question loading helpers for notebook runtime and tests
training/notebook_pipeline.py Notebook orchestration helpers for trainer setup, baseline, and metrics
notebooks/train_grpo.ipynb End-to-end training notebook
tests/integration/test_training_pipeline.py Integration verification for rollout + rewards pipeline
tests/e2e/test_training_e2e.py Notebook smoke verification and pipeline behavior checks
tests/unit/test_error_handling.py Error-path verification for model/questions loading and fallback logging

Files to Modify

File Changes
pyproject.toml Add trl and training optional dependency group

Files to Delete

None.


3. Interface Specifications

New Types

# Location: training/config.py

from dataclasses import dataclass, field

@dataclass
class GRPOConfig:
    """All hyperparameters for GRPO training on SQLEnv."""

    # Model
    model_name: str = "Qwen/Qwen3-1.7B"
    max_new_tokens: int = 256

    # Training
    num_train_epochs: int = 1
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 4
    learning_rate: float = 5e-6
    num_generations: int = 4          # G in GRPO (completions per prompt)

    # Environment
    questions_path: str = "data/questions/questions_train.json"
    db_dir: str = "data/databases"
    step_budget: int = 10             # Shorter budget for training
    difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])

    # Reproducibility
    seed: int = 42

    # Output
    output_dir: str = "outputs/grpo_run"
    logging_steps: int = 10

New Functions

# Location: training/prompts.py

def get_system_prompt() -> str:
    """Return the system prompt for the SQL exploration agent.

    Returns:
        System prompt string instructing the model on SQLEnv action format.
    """


def format_observation(obs: "SQLObservation") -> str:
    """Format an SQLObservation into a user-turn string for the model.

    Args:
        obs: The observation from the environment.

    Returns:
        Formatted string suitable as a user message in chat history.
    """
# Location: training/rollout.py

from typing import Any

def rollout_func(
    prompts: list[str],
    model: Any,
    tokenizer: Any,
    config: "GRPOConfig",
) -> list[dict[str, Any]]:
    """Play SQLEnv episodes for a batch of question prompts.

    Each prompt is a question text. The function:
    1. Creates a local SQLEnvironment
    2. Resets with the question
    3. Loops: model.generate() -> parse action -> env.step()
    4. Collects completions and metadata

    Args:
        prompts: List of question texts (from training dataset).
        model: HuggingFace model for generation.
        tokenizer: HuggingFace tokenizer.
        config: Training configuration.

    Returns:
        List of dicts with keys:
          - "prompt": str (the input prompt)
          - "completion": str (full model output trajectory)
          - "metadata": dict with episode_id, steps, done, answer_correct
    """
# Location: training/rewards.py

def reward_correctness(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout (includes 'metadata' key).

    Returns:
        List of float rewards, one per completion.
    """


def reward_progress(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Progress reward: cumulative progress score from environment.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """


def reward_operational(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """

4. Data Flow

Primary Flow (Training Loop)

1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
   - Input: config.model_name
   - Output: model, tokenizer, config

2. Load training questions filtered by difficulty
   - Input: config.questions_path, config.difficulty_filter
   - Output: list[str] of question texts as prompts

3. GRPOTrainer calls rollout_func for each batch of prompts
   - Input: prompts, model, tokenizer, config
   - Action: For each prompt, play a full SQLEnv episode
     a. Create local SQLEnvironment
     b. env.reset(question) -> initial observation
     c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
     d. Collect full trajectory as completion string
   - Output: completions + metadata (correctness, progress, operational signals)

4. GRPOTrainer calls each reward_func on completions
   - Input: completions list, metadata kwargs
   - Output: list[float] per reward function

5. GRPOTrainer computes GRPO loss and updates model weights
   - Input: completions, rewards, model
   - Output: updated model weights, logged metrics

6. Repeat steps 3-5 for num_train_epochs

Alternative Flow: Unparseable Model Output

1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)

Alternative Flow: Episode Exceeds Token Budget

1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
   a. System prompt (always)
   b. Most recent 3 observation-action pairs
3. Episode continues with truncated context

5. Error Handling

Error Types

Error When Strategy
ModelLoadError Model not found on HuggingFace Fail fast with clear message naming model_name
ActionParseError Model output not parseable as SQLAction Default to QUERY with raw text, log warning
OOMError GPU out of memory during training Print guidance: reduce batch_size or num_generations
QuestionLoadError Questions file missing or empty Fail fast with path in error message
EnvironmentError SQLEnv database missing Fail fast pointing to data download instructions

Error Handling Strategy

# In rollout_func: graceful degradation
try:
    action = parse_action(model_output)
except ActionParseError:
    action = SQLAction(action_type="QUERY", argument=model_output)

# In notebook: fail-fast on setup
try:
    model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
    raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")

Retry Strategy

Operation Retry? Strategy
Model download No Fail fast, user must fix network/model name
Episode rollout No Single attempt per episode, errors become low-reward signal
Training step No OOM is fatal for that config, must adjust params

6. Slice Plan (What we will ship, in order)

Slice S1 -- Training Config + Prompts

Value: Centralized, documented configuration and system prompt ready for training integration User-visible change: No (internal infrastructure) Interfaces introduced/changed: GRPOConfig, get_system_prompt(), format_observation() Rollback safety: Additive only -- new files, no existing code changed

Slice S2 -- Rollout + Rewards

Value: TRL-compatible rollout and reward functions that can drive GRPO training User-visible change: No (library code) Interfaces introduced/changed: rollout_func(), reward_correctness(), reward_progress(), reward_operational() Rollback safety: Additive only -- new files in training/ package

Slice S3 -- Training Notebook

Value: Users can run one notebook to train a model and see before/after results User-visible change: Yes -- the notebook is the primary deliverable Interfaces introduced/changed: notebooks/train_grpo.ipynb, pyproject.toml training deps Rollback safety: Notebook is standalone; pyproject.toml change is additive (optional deps group)


7. Implementation Steps

VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md. The verification-planner (separate agent) generated independent test criteria. Run the tests specified there after implementing each step.

Step 1.1: Training Config Dataclass

Slice: S1 Goal: Create training/config.py with GRPOConfig dataclass holding all hyperparameters.

Files:

  • training/__init__.py - create - package init with public exports
  • training/config.py - create - GRPOConfig dataclass

Interface Changes:

  • New type: GRPOConfig with fields as specified in Section 3

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T06:44:31Z Changes Made:

  • Created training/config.py with GRPOConfig dataclass and input validation in __post_init__
  • Created training/__init__.py exporting GRPOConfig
  • Added tests/unit/test_grpo_config.py covering defaults, overrides, required fields, and validation failures

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
    Result: 7 passed in 17.06s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
  • Notes:
    • Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
    • uv run pytest ... failed because pytest is not installed by default; used uv run --with pytest pytest ... for scoped test dependency
    • Kept config required fields (questions_path, db_dir, output_dir) positional/required per verification criteria
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • GRPOConfig available for import by prompts.py and rollout.py

Step 1.2: System Prompt and Observation Formatter

Slice: S1 Goal: Create training/prompts.py with system prompt and observation formatting for model input.

Files:

  • training/prompts.py - create - system prompt and observation formatter

Interface Changes:

  • New functions: get_system_prompt() -> str, format_observation(obs: SQLObservation) -> str

Details:

  • System prompt should instruct the model on:
    • Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
    • Action format: ACTION_TYPE: argument
    • Exploration strategy guidance (describe tables first, then query, then answer)
    • Budget awareness
  • format_observation converts SQLObservation fields into a readable user-turn string

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T06:47:49Z Changes Made:

  • Created training/prompts.py with deterministic get_system_prompt() and format_observation() helpers
  • Added truncation guard for long observation results to keep prompt payload bounded
  • Updated training/__init__.py exports to include prompt helpers
  • Added tests/unit/test_prompts.py covering prompt content and observation formatting edge cases

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_prompts.py -v
    Result: 8 passed in 2.92s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_prompts.py -v
  • Notes:
    • uv run pytest ... failed because pytest is not installed in the base env; used uv run --with pytest pytest ... for scoped dependency execution
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • Prompt module ready for use in rollout.py

Step 2.1: Action Parser Utility

Slice: S2 Goal: Create a robust parser that extracts SQLAction from free-form model output text.

Files:

  • training/rollout.py - create - contains parse_model_output(text: str) -> SQLAction

Interface Changes:

  • New function: parse_model_output(text: str) -> SQLAction
    • Parses ACTION_TYPE: argument format from model text
    • Falls back to SQLAction(action_type="QUERY", argument=text) on parse failure

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T06:51:50Z Changes Made:

  • Created training/rollout.py with parse_model_output(text) and a focused line parser helper
  • Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
  • Added robust fallback behavior to SQLAction(action_type="QUERY", argument=<raw_text>) on parse failure
  • Added tests/unit/test_rollout.py with coverage for happy path, edge cases, multiline output, and fallback behavior

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
    Result: 11 passed in 2.44s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_rollout.py -v
  • Notes:
    • uv run pytest ... failed because pytest is not installed in the base env; used uv run --with pytest pytest ... for scoped dependency execution
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • parse_model_output is available in training/rollout.py for Step 2.2 rollout integration

Step 2.2: Rollout Function

Slice: S2 Goal: Implement rollout_func that plays full SQLEnv episodes using HF generate.

Files:

  • training/rollout.py - modify - add rollout_func and play_episode helper

Interface Changes:

  • New function: rollout_func(prompts, model, tokenizer, config) -> list[dict]
  • New helper: play_episode(question_text, model, tokenizer, config, env) -> dict
    • Creates local SQLEnvironment for the episode
    • Loops: format obs -> generate -> parse -> step until done or budget exhausted
    • Returns completion string and metadata dict

Details:

  • Use model.generate() (HF native, not vLLM) for inference
  • Build chat messages using tokenizer.apply_chat_template
  • Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
  • Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Medium

Core integration point between model and environment -- most likely source of bugs.

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T07:04:59Z Changes Made:

  • Expanded training/rollout.py with rollout_func, play_episode, message-history truncation, prompt-aware environment reset, and HF model.generate() integration paths for both list and tensor-like outputs.
  • Added rollout metadata fields (episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals) and top-level compatibility keys (content, correct, progress, operational).
  • Extended tests/unit/test_rollout.py with Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
    Result: 21 passed in 2.58s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_rollout.py -v
  • Notes:
    • Used uv run --with pytest ... because pytest is not available in the base environment.
    • Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • rollout metadata now carries correctness/progress/operational signals needed by training/rewards.py in Step 2.3

Step 2.3: Reward Functions

Slice: S2 Goal: Implement three TRL-compatible reward callables that consume rollout metadata.

Files:

  • training/rewards.py - create - reward_correctness, reward_progress, reward_operational

Interface Changes:

  • New functions (all with TRL reward_func signature):
    • reward_correctness(completions, **kwargs) -> list[float]
    • reward_progress(completions, **kwargs) -> list[float]
    • reward_operational(completions, **kwargs) -> list[float]

Details:

  • reward_correctness: Binary 1.0/0.0 based on metadata["answer_correct"]
  • reward_progress: Float from metadata["cumulative_progress"], normalized to [0, 1]
  • reward_operational: Sum of per-step operational signals from metadata["operational_signals"]
  • All functions access metadata via kwargs (TRL passes extra data from rollout return)
  • Each function must handle missing metadata gracefully (return 0.0)

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Low

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T07:07:32Z Changes Made:

  • Created training/rewards.py with TRL-compatible reward_correctness, reward_progress, and reward_operational callables
  • Added robust metadata extraction paths so reward functions support both nested metadata payloads and flattened rollout kwargs
  • Updated training/__init__.py exports for reward helper imports from the package root
  • Added tests/unit/test_rewards.py covering correctness/progress/operational behavior across happy path, edge, and batch scenarios

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_rewards.py -v
    Result: 19 passed in 3.35s
    
  • Tests run: uv run --with pytest pytest tests/unit/test_rewards.py -v
  • Notes:
    • Used uv run --with pytest ... because pytest is not available in the base environment.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • training/ now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies in pyproject.toml

Step 3.1: Training Notebook

Slice: S3 Goal: Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.

Files:

  • notebooks/train_grpo.ipynb - create - end-to-end training notebook
  • pyproject.toml - modify - add [project.optional-dependencies] training group

Interface Changes:

  • New optional dependency group: training = ["trl>=0.12.0", "accelerate>=0.34.0"]

Details: Notebook cells (linear flow):

  1. Setup: Install dependencies, import modules, set seed
  2. Config: Instantiate GRPOConfig (users can override model_name here)
  3. Load Model: AutoModelForCausalLM.from_pretrained(config.model_name)
  4. Load Dataset: Load questions, filter by difficulty, format as prompts
  5. Initialize GRPOTrainer: Pass model, tokenizer, rollout_func, reward_funcs, config
  6. Train: trainer.train() with progress bar and metric printing
  7. Learning Curve: Plot reward over training steps (matplotlib)
  8. Comparison: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
  9. Save: Save trained model to config.output_dir

Verification:

See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

Risk Tier for This Step: Medium

User-facing deliverable; must work on fresh setup.

Merge Criteria:

  • Tests from VERIFICATION_SPEC.md pass
  • No TODOs left in changed code (or explicitly tracked)
  • Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T07:37:20Z Changes Made:

  • Created notebooks/train_grpo.ipynb as the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps.
  • Added [project.optional-dependencies].training in pyproject.toml with trl>=0.14.0,<0.15.0 and accelerate>=0.34.0 to keep TRL/torch compatibility stable for this repository.
  • Added training/data_loading.py to centralize notebook error handling for model loading and question filtering/loading.
  • Added training/notebook_pipeline.py to centralize trainer wiring, random baseline generation, training execution, and metrics extraction.
  • Updated training/__init__.py exports to include notebook-facing helpers.
  • Added tests/e2e/test_training_e2e.py for notebook smoke structure + pipeline behavior checks.
  • Added tests/integration/test_training_pipeline.py for rollout/reward integration scenarios.
  • Added tests/unit/test_error_handling.py for model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.

Result:

  • Outcome: OK Fully Successful
  • Evidence Captured:
    Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
    Result: 68 passed in 5.79s
    Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"
    Result: ok
    
  • Tests run: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
  • Notes:
    • Added concrete integration/e2e/error test files that were listed in VERIFICATION_SPEC.md but missing from repository.
    • Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
    • Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
  • Issues: None
  • Follow-ups Created: None
  • Human Review Completed: N/A

Context for Next Step:

  • All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.

8. Rollout Considerations

Feature Flags

  • Required: No

Migration

  • Data migration needed: No

Rollback Plan

All changes are additive (new training/ package and notebooks/ directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.


9. Execution Tracking

All execution state is tracked within this document:

  • Section 1a: Overall progress summary
  • Section 7: Per-step completion details, test results, and handoff context
  • FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
  • Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

  • Checking Section 1a for summary
  • Reviewing Section 7 for detailed step status
  • Inspecting the feature's progress and status fields in FEATURES.json
  • Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

  1. Run verifier subagent for spec compliance

    • Validates against VERIFICATION_SPEC.md criteria
    • Ensures no TODOs or incomplete work in slice
  2. Run compound-engineer subagent to extract learnings

    • Mandatory invocation after every slice completion
    • Updates CLAUDE.md Learnings section (if durable patterns found)
    • May exit with "no update needed" (valid for routine work)
  3. Commit the slice changes

    • Follow commit message format in CLAUDE.md
    • Each slice gets its own atomic commit
  4. Continue to next slice (if more slices remain)

    • Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.


10. User Value Summary

Status: Generated

What Users Can Now Do

Users can now run a single notebook (notebooks/train_grpo.ipynb) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.

How to Access/Test

  1. Install training extras: uv sync --extra training
  2. Open notebooks/train_grpo.ipynb
  3. Run all cells to train and save artifacts to outputs/grpo_run

Demo

  • Command: jupyter notebook notebooks/train_grpo.ipynb
  • Verification command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v

Release Notes Snippet

Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.


11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Scope

  • Finalized Step 3.1 (Training Notebook) for F006.
  • Added training optional dependency group in pyproject.toml with TRL pin compatible with repo torch version.
  • Added notebook support helpers for model/question loading and trainer orchestration.
  • Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.

Files Changed

  • pyproject.toml
  • notebooks/train_grpo.ipynb
  • training/__init__.py
  • training/data_loading.py
  • training/notebook_pipeline.py
  • training/rollout.py
  • tests/e2e/test_training_e2e.py
  • tests/integration/test_training_pipeline.py
  • tests/unit/test_error_handling.py
  • specs/F006-IMPLEMENTATION_SPEC.md
  • specs/behavior/training.md

Verification Evidence

  • uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v -> 68 passed
  • uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')" -> ok
  • Verifier verdict: APPROVED (specs/F006-VERIFICATION_REPORT.md)

Risk and Rollback

  • Risk tier: Medium (training dependencies and user-facing notebook workflow).
  • Rollback: remove notebook/training helper additions and revert pyproject.toml training extra.

Ready for Next Command

All implementation and verification criteria for F006 are complete. Run /commit-push-pr when ready.


Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

  • A step requires touching more than 3 files in unrelated areas
  • You need to introduce multiple new abstractions "just in case"
  • Verification cannot be made targeted and concrete
  • You discover new unknowns that change the plan materially
  • The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.


Human Checkpoint

Before handing to AI agent:

  • Interface specifications are complete
  • Data flow is accurate
  • Error handling is specified
  • Implementation order makes sense
  • VERIFICATION_SPEC.md has been generated

Questions:

  1. Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
  2. Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
  - HF generate (not vLLM) for inference
  - Model name is a config parameter (default Qwen3-1.7B)
  - Start with easy+medium questions only
  - Follow TRL GRPOTrainer Wordle tutorial pattern
  - reward_funcs are separate callables

Specification completed: 2026-03-27 Approved by: [pending] Verification spec: VERIFICATION_SPEC.md Target agent: Claude Code