Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F006-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified about 2 months ago

preview code

raw

history blame contribute delete

35.3 kB

Implementation Specification

Change: F006 -- GRPO Training Pipeline Date: 2026-03-27 Research Summary: specs/F006-RESEARCH_SUMMARY.md Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner) Behavior Delta: Archived to specs/behavior/training.md

Plan Status:

Draft
Approved for Implementation
Implementation Complete
Verification Passed

Core Intent (Immutable)

User Problem: Train a model that learns SQL exploration strategy through RL. The "before vs after" comparison is the competition's money shot -- untrained agent flails randomly, trained agent explores strategically.

Success Criteria:

Training notebook runs end-to-end in one click
Learning curve clearly shows improvement over episodes
Side-by-side episode transcripts: random vs trained
Reproducible results (deterministic given seed)

Avoid:

Training that does not converge at all (no learning signal)
Requiring an expensive GPU for hours to see any signal
Notebook with hidden dependencies that break on fresh setup

Out of Scope:

wandb / TensorBoard integration (MVP: print metrics)
vLLM inference (use HF generate for simplicity)
Hard-difficulty questions in training set (add later)
WebSocket-based training (use local env)
Multi-GPU / distributed training
Custom RLHF algorithms beyond GRPO

0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in small, mergeable increments.

Scope Budget

Target: 3 slices
Hard max: <= 10 steps total
Each step must end in: implement -> verify -> merge

Slice Definition

Slice	Name	Value
S1	Training Config + Prompts	Configurable training setup, system prompt for SQL agent
S2	Rollout + Rewards	TRL-compatible rollout function and reward callables
S3	Training Notebook	End-to-end notebook with learning curve and comparison

Status Icons

Step Status:

!! Not Started
In Progress
OK Completed
XX Blocked/Failed

Result Outcome:

OK Fully Successful (all tests passed, no issues)
!! Completed with Issues (needs follow-up)
XX Failed/Blocked

1. Implementation Overview

Summary

Add a training/ subpackage with configuration, rollout, reward wrappers, and prompt modules that integrate with TRL's GRPOTrainer. Provide a notebooks/train_grpo.ipynb notebook as the user-facing entry point that trains a small LLM (default: Qwen3-1.7B) to play SQLEnv, then produces learning curves and before/after episode comparisons.

Scope

In Scope:

training/config.py -- dataclass with all hyperparameters and model name
training/prompts.py -- system prompt for SQL exploration agent
training/rollout.py -- rollout_func that plays SQLEnv episodes via HF generate
training/rewards.py -- reward callables matching TRL reward_funcs signature
notebooks/train_grpo.ipynb -- end-to-end training notebook
training/__init__.py -- public exports

Out of Scope:

vLLM inference backend
wandb/TensorBoard logging
Training on hard-difficulty questions
Distributed or multi-GPU training

1a. Execution Status

Progress: 6/6 steps complete Current Step: None (implementation complete) Last Updated: 2026-03-28T07:37:20Z Latest Result: OK Fully Successful - Step 3.1 complete, 68/68 tests passed Blockers: None

1b. Risk Assessment

Risk Tier: Medium

Risk Tier Definitions:

Low: Pure logic, non-user-facing, no security implications
Medium: User input handling, data validation, API changes
High: Authentication, payments, secrets management, untrusted input

High-Risk Indicators Present: None

Security Review Required: No

Justification: External model loading from HuggingFace Hub and GPU resource management require care, but no security-sensitive data flows. Risk is primarily around convergence and resource requirements.

2. Change Manifest

Files to Create

File	Purpose
`training/__init__.py`	Package init, public exports
`training/config.py`	`GRPOConfig` dataclass with hyperparameters
`training/prompts.py`	System prompt for SQL exploration agent
`training/rollout.py`	`rollout_func` for TRL GRPOTrainer
`training/rewards.py`	Reward callables: correctness, progress, operational
`training/data_loading.py`	Model/question loading helpers for notebook runtime and tests
`training/notebook_pipeline.py`	Notebook orchestration helpers for trainer setup, baseline, and metrics
`notebooks/train_grpo.ipynb`	End-to-end training notebook
`tests/integration/test_training_pipeline.py`	Integration verification for rollout + rewards pipeline
`tests/e2e/test_training_e2e.py`	Notebook smoke verification and pipeline behavior checks
`tests/unit/test_error_handling.py`	Error-path verification for model/questions loading and fallback logging

Files to Modify

File	Changes
`pyproject.toml`	Add `trl` and training optional dependency group

Files to Delete

None.

3. Interface Specifications

New Types

# Location: training/config.py

from dataclasses import dataclass, field

@dataclass
class GRPOConfig:
    """All hyperparameters for GRPO training on SQLEnv."""

    # Model
    model_name: str = "Qwen/Qwen3-1.7B"
    max_new_tokens: int = 256

    # Training
    num_train_epochs: int = 1
    per_device_train_batch_size: int = 2
    gradient_accumulation_steps: int = 4
    learning_rate: float = 5e-6
    num_generations: int = 4          # G in GRPO (completions per prompt)

    # Environment
    questions_path: str = "data/questions/questions_train.json"
    db_dir: str = "data/databases"
    step_budget: int = 10             # Shorter budget for training
    difficulty_filter: list[str] = field(default_factory=lambda: ["easy", "medium"])

    # Reproducibility
    seed: int = 42

    # Output
    output_dir: str = "outputs/grpo_run"
    logging_steps: int = 10

New Functions

# Location: training/prompts.py

def get_system_prompt() -> str:
    """Return the system prompt for the SQL exploration agent.

    Returns:
        System prompt string instructing the model on SQLEnv action format.
    """


def format_observation(obs: "SQLObservation") -> str:
    """Format an SQLObservation into a user-turn string for the model.

    Args:
        obs: The observation from the environment.

    Returns:
        Formatted string suitable as a user message in chat history.
    """

# Location: training/rollout.py

from typing import Any

def rollout_func(
    prompts: list[str],
    model: Any,
    tokenizer: Any,
    config: "GRPOConfig",
) -> list[dict[str, Any]]:
    """Play SQLEnv episodes for a batch of question prompts.

    Each prompt is a question text. The function:
    1. Creates a local SQLEnvironment
    2. Resets with the question
    3. Loops: model.generate() -> parse action -> env.step()
    4. Collects completions and metadata

    Args:
        prompts: List of question texts (from training dataset).
        model: HuggingFace model for generation.
        tokenizer: HuggingFace tokenizer.
        config: Training configuration.

    Returns:
        List of dicts with keys:
          - "prompt": str (the input prompt)
          - "completion": str (full model output trajectory)
          - "metadata": dict with episode_id, steps, done, answer_correct
    """

# Location: training/rewards.py

def reward_correctness(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Binary reward: 1.0 if episode ended with correct answer, 0.0 otherwise.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout (includes 'metadata' key).

    Returns:
        List of float rewards, one per completion.
    """


def reward_progress(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Progress reward: cumulative progress score from environment.

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """


def reward_operational(
    completions: list[list[dict[str, str]]],
    **kwargs: Any,
) -> list[float]:
    """Operational reward: sum of per-step L1 signals (exec_ok, new_info, etc.).

    Args:
        completions: Batch of completion message lists (TRL format).
        **kwargs: Additional metadata from rollout.

    Returns:
        List of float rewards, one per completion.
    """

4. Data Flow

Primary Flow (Training Loop)

1. Notebook loads GRPOConfig and model/tokenizer from HuggingFace
   - Input: config.model_name
   - Output: model, tokenizer, config

2. Load training questions filtered by difficulty
   - Input: config.questions_path, config.difficulty_filter
   - Output: list[str] of question texts as prompts

3. GRPOTrainer calls rollout_func for each batch of prompts
   - Input: prompts, model, tokenizer, config
   - Action: For each prompt, play a full SQLEnv episode
     a. Create local SQLEnvironment
     b. env.reset(question) -> initial observation
     c. Loop: format obs -> model.generate() -> parse SQLAction -> env.step()
     d. Collect full trajectory as completion string
   - Output: completions + metadata (correctness, progress, operational signals)

4. GRPOTrainer calls each reward_func on completions
   - Input: completions list, metadata kwargs
   - Output: list[float] per reward function

5. GRPOTrainer computes GRPO loss and updates model weights
   - Input: completions, rewards, model
   - Output: updated model weights, logged metrics

6. Repeat steps 3-5 for num_train_epochs

Alternative Flow: Unparseable Model Output

1. Model generates text that cannot be parsed as SQLAction
2. rollout_func defaults to QUERY action with raw text as argument
3. Environment returns an error observation
4. Episode continues (agent can recover in subsequent steps)

Alternative Flow: Episode Exceeds Token Budget

1. Observation context grows beyond max_new_tokens window
2. rollout_func truncates conversation history, keeping:
   a. System prompt (always)
   b. Most recent 3 observation-action pairs
3. Episode continues with truncated context

5. Error Handling

Error Types

Error	When	Strategy
`ModelLoadError`	Model not found on HuggingFace	Fail fast with clear message naming model_name
`ActionParseError`	Model output not parseable as SQLAction	Default to QUERY with raw text, log warning
`OOMError`	GPU out of memory during training	Print guidance: reduce batch_size or num_generations
`QuestionLoadError`	Questions file missing or empty	Fail fast with path in error message
`EnvironmentError`	SQLEnv database missing	Fail fast pointing to data download instructions

Error Handling Strategy

# In rollout_func: graceful degradation
try:
    action = parse_action(model_output)
except ActionParseError:
    action = SQLAction(action_type="QUERY", argument=model_output)

# In notebook: fail-fast on setup
try:
    model = AutoModelForCausalLM.from_pretrained(config.model_name)
except Exception as e:
    raise RuntimeError(f"Cannot load model '{config.model_name}': {e}")

Retry Strategy

Operation	Retry?	Strategy
Model download	No	Fail fast, user must fix network/model name
Episode rollout	No	Single attempt per episode, errors become low-reward signal
Training step	No	OOM is fatal for that config, must adjust params

6. Slice Plan (What we will ship, in order)

Slice S1 -- Training Config + Prompts

Value: Centralized, documented configuration and system prompt ready for training integration User-visible change: No (internal infrastructure) Interfaces introduced/changed: GRPOConfig, get_system_prompt(), format_observation() Rollback safety: Additive only -- new files, no existing code changed

Slice S2 -- Rollout + Rewards

Value: TRL-compatible rollout and reward functions that can drive GRPO training User-visible change: No (library code) Interfaces introduced/changed: rollout_func(), reward_correctness(), reward_progress(), reward_operational() Rollback safety: Additive only -- new files in training/ package

Slice S3 -- Training Notebook

Value: Users can run one notebook to train a model and see before/after results User-visible change: Yes -- the notebook is the primary deliverable Interfaces introduced/changed: notebooks/train_grpo.ipynb, pyproject.toml training deps Rollback safety: Notebook is standalone; pyproject.toml change is additive (optional deps group)

7. Implementation Steps

Step 1.1: Training Config Dataclass

Slice: S1 Goal: Create training/config.py with GRPOConfig dataclass holding all hyperparameters.

Files:

training/__init__.py - create - package init with public exports
training/config.py - create - GRPOConfig dataclass

Interface Changes:

New type: GRPOConfig with fields as specified in Section 3

Verification:

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T06:44:31Z Changes Made:

Created training/config.py with GRPOConfig dataclass and input validation in __post_init__
Created training/__init__.py exporting GRPOConfig
Added tests/unit/test_grpo_config.py covering defaults, overrides, required fields, and validation failures

Result:

Outcome: OK Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
Result: 7 passed in 17.06s

Tests run: uv run --with pytest pytest tests/unit/test_grpo_config.py -v
Notes:
- Added explicit validation for numeric bounds and non-empty difficulty filter to fail fast during setup
- uv run pytest ... failed because pytest is not installed by default; used uv run --with pytest pytest ... for scoped test dependency
- Kept config required fields (questions_path, db_dir, output_dir) positional/required per verification criteria
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

GRPOConfig available for import by prompts.py and rollout.py

Step 1.2: System Prompt and Observation Formatter

Slice: S1 Goal: Create training/prompts.py with system prompt and observation formatting for model input.

Files:

training/prompts.py - create - system prompt and observation formatter

Interface Changes:

New functions: get_system_prompt() -> str, format_observation(obs: SQLObservation) -> str

Details:

System prompt should instruct the model on:
- Available actions: DESCRIBE, SAMPLE, QUERY, ANSWER
- Action format: ACTION_TYPE: argument
- Exploration strategy guidance (describe tables first, then query, then answer)
- Budget awareness
format_observation converts SQLObservation fields into a readable user-turn string

Verification:

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T06:47:49Z Changes Made:

Created training/prompts.py with deterministic get_system_prompt() and format_observation() helpers
Added truncation guard for long observation results to keep prompt payload bounded
Updated training/__init__.py exports to include prompt helpers
Added tests/unit/test_prompts.py covering prompt content and observation formatting edge cases

Result:

Outcome: OK Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_prompts.py -v
Result: 8 passed in 2.92s

Tests run: uv run --with pytest pytest tests/unit/test_prompts.py -v
Notes:
- uv run pytest ... failed because pytest is not installed in the base env; used uv run --with pytest pytest ... for scoped dependency execution
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

Prompt module ready for use in rollout.py

Step 2.1: Action Parser Utility

Slice: S2 Goal: Create a robust parser that extracts SQLAction from free-form model output text.

Files:

training/rollout.py - create - contains parse_model_output(text: str) -> SQLAction

Interface Changes:

New function: parse_model_output(text: str) -> SQLAction
- Parses ACTION_TYPE: argument format from model text
- Falls back to SQLAction(action_type="QUERY", argument=text) on parse failure

Verification:

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T06:51:50Z Changes Made:

Created training/rollout.py with parse_model_output(text) and a focused line parser helper
Added action parsing for DESCRIBE/SAMPLE/QUERY/ANSWER with case-insensitive matching
Added robust fallback behavior to SQLAction(action_type="QUERY", argument=<raw_text>) on parse failure
Added tests/unit/test_rollout.py with coverage for happy path, edge cases, multiline output, and fallback behavior

Result:

Outcome: OK Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
Result: 11 passed in 2.44s

Tests run: uv run --with pytest pytest tests/unit/test_rollout.py -v
Notes:
- uv run pytest ... failed because pytest is not installed in the base env; used uv run --with pytest pytest ... for scoped dependency execution
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

parse_model_output is available in training/rollout.py for Step 2.2 rollout integration

Step 2.2: Rollout Function

Slice: S2 Goal: Implement rollout_func that plays full SQLEnv episodes using HF generate.

Files:

training/rollout.py - modify - add rollout_func and play_episode helper

Interface Changes:

New function: rollout_func(prompts, model, tokenizer, config) -> list[dict]
New helper: play_episode(question_text, model, tokenizer, config, env) -> dict
- Creates local SQLEnvironment for the episode
- Loops: format obs -> generate -> parse -> step until done or budget exhausted
- Returns completion string and metadata dict

Details:

Use model.generate() (HF native, not vLLM) for inference
Build chat messages using tokenizer.apply_chat_template
Truncate conversation history if it exceeds token window (keep system prompt + last 3 turns)
Metadata includes: episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals

Verification:

Risk Tier for This Step: Medium

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T07:04:59Z Changes Made:

Expanded training/rollout.py with rollout_func, play_episode, message-history truncation, prompt-aware environment reset, and HF model.generate() integration paths for both list and tensor-like outputs.
Added rollout metadata fields (episode_id, step_count, done, answer_correct, cumulative_progress, operational_signals) and top-level compatibility keys (content, correct, progress, operational).
Extended tests/unit/test_rollout.py with Step 2.2 coverage for batch behavior, step-budget termination, metadata shape, unparseable-action fallback continuity, history truncation, HF-style generation decoding, prompt binding, and incorrect-answer correctness guard.

Result:

Outcome: OK Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_rollout.py -v
Result: 21 passed in 2.58s

Tests run: uv run --with pytest pytest tests/unit/test_rollout.py -v
Notes:
- Used uv run --with pytest ... because pytest is not available in the base environment.
- Medium-risk reviewer gate executed and resolved to APPROVE after decoder/correctness fixes.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

rollout metadata now carries correctness/progress/operational signals needed by training/rewards.py in Step 2.3

Step 2.3: Reward Functions

Slice: S2 Goal: Implement three TRL-compatible reward callables that consume rollout metadata.

Files:

training/rewards.py - create - reward_correctness, reward_progress, reward_operational

Interface Changes:

New functions (all with TRL reward_func signature):
- reward_correctness(completions, **kwargs) -> list[float]
- reward_progress(completions, **kwargs) -> list[float]
- reward_operational(completions, **kwargs) -> list[float]

Details:

reward_correctness: Binary 1.0/0.0 based on metadata["answer_correct"]
reward_progress: Float from metadata["cumulative_progress"], normalized to [0, 1]
reward_operational: Sum of per-step operational signals from metadata["operational_signals"]
All functions access metadata via kwargs (TRL passes extra data from rollout return)
Each function must handle missing metadata gracefully (return 0.0)

Verification:

Risk Tier for This Step: Low

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T07:07:32Z Changes Made:

Created training/rewards.py with TRL-compatible reward_correctness, reward_progress, and reward_operational callables
Added robust metadata extraction paths so reward functions support both nested metadata payloads and flattened rollout kwargs
Updated training/__init__.py exports for reward helper imports from the package root
Added tests/unit/test_rewards.py covering correctness/progress/operational behavior across happy path, edge, and batch scenarios

Result:

Outcome: OK Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_rewards.py -v
Result: 19 passed in 3.35s

Tests run: uv run --with pytest pytest tests/unit/test_rewards.py -v
Notes:
- Used uv run --with pytest ... because pytest is not available in the base environment.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

training/ now exposes config, prompts, rollout parsing/execution, and reward callables; next step is notebook wiring plus optional training dependencies in pyproject.toml

Step 3.1: Training Notebook

Slice: S3 Goal: Create end-to-end training notebook that loads model, trains with GRPO, and produces learning curves.

Files:

notebooks/train_grpo.ipynb - create - end-to-end training notebook
pyproject.toml - modify - add [project.optional-dependencies] training group

Interface Changes:

New optional dependency group: training = ["trl>=0.12.0", "accelerate>=0.34.0"]

Details: Notebook cells (linear flow):

Setup: Install dependencies, import modules, set seed
Config: Instantiate GRPOConfig (users can override model_name here)
Load Model: AutoModelForCausalLM.from_pretrained(config.model_name)
Load Dataset: Load questions, filter by difficulty, format as prompts
Initialize GRPOTrainer: Pass model, tokenizer, rollout_func, reward_funcs, config
Train: trainer.train() with progress bar and metric printing
Learning Curve: Plot reward over training steps (matplotlib)
Comparison: Run 5 episodes with random actions vs trained model, display side-by-side transcripts
Save: Save trained model to config.output_dir

Verification:

Risk Tier for This Step: Medium

Merge Criteria:

Tests from VERIFICATION_SPEC.md pass
No TODOs left in changed code (or explicitly tracked)
Backwards compatible (or flag/migration documented)

Status: OK Completed

Completed: 2026-03-28T07:37:20Z Changes Made:

Created notebooks/train_grpo.ipynb as the primary user-facing training notebook for F006, with one-pass setup, model/question loading, trainer construction, training execution, learning-curve plotting, random-baseline vs trained transcript comparison, and artifact save steps.
Added [project.optional-dependencies].training in pyproject.toml with trl>=0.14.0,<0.15.0 and accelerate>=0.34.0 to keep TRL/torch compatibility stable for this repository.
Added training/data_loading.py to centralize notebook error handling for model loading and question filtering/loading.
Added training/notebook_pipeline.py to centralize trainer wiring, random baseline generation, training execution, and metrics extraction.
Updated training/__init__.py exports to include notebook-facing helpers.
Added tests/e2e/test_training_e2e.py for notebook smoke structure + pipeline behavior checks.
Added tests/integration/test_training_pipeline.py for rollout/reward integration scenarios.
Added tests/unit/test_error_handling.py for model/question loading failures, OOM guidance messaging, and parse-fallback warning logging.

Result:

Outcome: OK Fully Successful

Evidence Captured:

Command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
Result: 68 passed in 5.79s
Command: uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')"
Result: ok

Tests run: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v
Notes:
- Added concrete integration/e2e/error test files that were listed in VERIFICATION_SPEC.md but missing from repository.
- Notebook now compares random-policy baseline transcripts against trained-policy transcripts, matching the feature's user-facing comparison goal.
- Parse fallback now emits a warning log to align behavior with error-handling verification expectations.
Issues: None
Follow-ups Created: None
Human Review Completed: N/A

Context for Next Step:

All implementation deliverables complete; feature is ready for final verification/finalization bookkeeping.

8. Rollout Considerations

Feature Flags

Required: No

Migration

Data migration needed: No

Rollback Plan

All changes are additive (new training/ package and notebooks/ directory). Rollback is simply removing those directories and reverting the pyproject.toml optional deps change.

9. Execution Tracking

All execution state is tracked within this document:

Section 1a: Overall progress summary
Section 7: Per-step completion details, test results, and handoff context
FEATURES.json: Feature-level status/progress metadata used by /autocode-next-step and opencode-ctx ralph run
Git history: Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching FEATURES.json entry in sync during implementation/finalization. Humans can monitor progress by:

Checking Section 1a for summary
Reviewing Section 7 for detailed step status
Inspecting the feature's progress and status fields in FEATURES.json
Running git log --oneline IMPLEMENTATION_SPEC.md for change history

9a. Slice Completion Protocol

After all steps in a slice pass verification:

Run verifier subagent for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
Run compound-engineer subagent to extract learnings
- Mandatory invocation after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
Commit the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
Continue to next slice (if more slices remain)
- Or proceed to final verification if all slices complete

Note: PR creation happens only after ALL slices are complete. Use /commit-push-pr manually when ready.

10. User Value Summary

Status: Generated

What Users Can Now Do

Users can now run a single notebook (notebooks/train_grpo.ipynb) to configure GRPO training, load a compatible TRL stack, train a model on SQLEnv prompts, and inspect both reward-curve output and transcript comparisons between random and trained policies.

How to Access/Test

Install training extras: uv sync --extra training
Open notebooks/train_grpo.ipynb
Run all cells to train and save artifacts to outputs/grpo_run

Demo

Command: jupyter notebook notebooks/train_grpo.ipynb
Verification command: uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v

Release Notes Snippet

Add a GRPO training pipeline for SQLEnv with a runnable notebook, pinned TRL training dependencies, robust loading/error helpers, and verification coverage across unit, integration, and notebook-smoke paths.

11. PR Contract (Auto-Generated by autocode-next-step)

Status: Generated

Scope

Finalized Step 3.1 (Training Notebook) for F006.
Added training optional dependency group in pyproject.toml with TRL pin compatible with repo torch version.
Added notebook support helpers for model/question loading and trainer orchestration.
Added/expanded verification tests for notebook smoke, pipeline integration, and error handling.

Files Changed

pyproject.toml
notebooks/train_grpo.ipynb
training/__init__.py
training/data_loading.py
training/notebook_pipeline.py
training/rollout.py
tests/e2e/test_training_e2e.py
tests/integration/test_training_pipeline.py
tests/unit/test_error_handling.py
specs/F006-IMPLEMENTATION_SPEC.md
specs/behavior/training.md

Verification Evidence

uv run --with pytest pytest tests/unit/test_grpo_config.py tests/unit/test_prompts.py tests/unit/test_rollout.py tests/unit/test_rewards.py tests/unit/test_error_handling.py tests/integration/test_training_pipeline.py tests/e2e/test_training_e2e.py -v -> 68 passed
uv run --extra training python -c "from trl import GRPOConfig, GRPOTrainer; print('ok')" -> ok
Verifier verdict: APPROVED (specs/F006-VERIFICATION_REPORT.md)

Risk and Rollback

Risk tier: Medium (training dependencies and user-facing notebook workflow).
Rollback: remove notebook/training helper additions and revert pyproject.toml training extra.

Ready for Next Command

All implementation and verification criteria for F006 are complete. Run /commit-push-pr when ready.

Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:

A step requires touching more than 3 files in unrelated areas
You need to introduce multiple new abstractions "just in case"
Verification cannot be made targeted and concrete
You discover new unknowns that change the plan materially
The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

Human Checkpoint

Before handing to AI agent:

Interface specifications are complete
Data flow is accurate
Error handling is specified
Implementation order makes sense
VERIFICATION_SPEC.md has been generated

Questions:

Confirm Qwen3-1.7B is accessible on HuggingFace Hub for the target environment.
Verify TRL GRPOTrainer API matches the rollout_func / reward_funcs signatures assumed here.

Handoff Notes

For the implementing AI agent:

Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions:
  - HF generate (not vLLM) for inference
  - Model name is a config parameter (default Qwen3-1.7B)
  - Start with easy+medium questions only
  - Follow TRL GRPOTrainer Wordle tutorial pattern
  - reward_funcs are separate callables

Specification completed: 2026-03-27 Approved by: [pending] Verification spec: VERIFICATION_SPEC.md Target agent: Claude Code