| # System Behavior: Training |
|
|
| > Living document. Updated by `/archive-spec` when features are completed. |
| > Last archived: F010 on 2026-03-28 |
|
|
| --- |
|
|
| ## Training Pipeline |
|
|
| ### Training notebook produces a trained model from one-click execution |
| <!-- since: F006 | test: tests/training/test_config.py::test_grpo_config_defaults --> |
|
|
| The system accepts a `notebooks/train_grpo.ipynb` notebook that, when run end-to-end, downloads a HuggingFace model, trains it on SQLEnv episodes using GRPO, and saves the trained weights to a configurable output directory. |
|
|
| ### Training produces a learning curve showing reward improvement |
| <!-- since: F006 --> |
|
|
| After training completes, the notebook displays a matplotlib plot of reward over training steps, showing whether the model learned to improve its SQL exploration strategy over the course of training. |
|
|
| ### Training produces side-by-side episode transcripts |
| <!-- since: F006 --> |
|
|
| After training completes, the notebook displays episode transcripts comparing random-action baseline episodes against trained-model episodes on the same questions, showing the difference in exploration behavior. |
|
|
| ### Rollout function plays SQLEnv episodes via model generation |
| <!-- since: F006 | test: tests/training/test_rollout.py::test_rollout_func --> |
| |
| The system accepts a batch of question prompts and returns episode completions by playing full SQLEnv episodes: resetting the environment, generating actions with HF model.generate(), parsing them into SQLActions, and stepping the environment until the episode ends. |
| |
| ### Reward functions return per-completion scores for GRPO training |
| <!-- since: F006 | test: tests/training/test_rewards.py::test_reward_correctness --> |
|
|
| The system accepts TRL-format completion batches and returns float reward lists from three independent callables: correctness (binary 0/1), progress (normalized cumulative progress), and operational (sum of per-step L1 signals). |
|
|
| ### Unparseable model output falls back to QUERY action |
| <!-- since: F006 | test: tests/training/test_rollout.py::test_parse_model_output_fallback --> |
| |
| When the model produces text that cannot be parsed as `ACTION_TYPE: argument` format, the system defaults to a QUERY action with the raw text as the argument, allowing the episode to continue rather than crashing. |
|
|
| ### TRL environment_factory integration |
| <!-- since: F010 | test: tests/unit/test_trl_adapter.py::test_configure_and_instantiate --> |
|
|
| The training system accepts a TRL-compatible environment class (`SQLEnvTRL`) as `environment_factory` for `GRPOTrainer`. TRL auto-discovers `describe`, `sample`, `query`, and `answer` as callable tools via typed docstrings and runs generation/tool-calling/multi-turn control flow without custom rollout glue. |
|
|
| ### Class-level environment configuration for no-arg factory construction |
| <!-- since: F010 | test: tests/unit/test_trl_adapter.py::test_configure_sets_class_attrs --> |
|
|
| The adapter accepts environment configuration (`questions_path`, `db_dir`, `step_budget`) through a `configure()` classmethod before trainer construction, satisfying TRL's no-argument `environment_factory` instantiation contract. |
|
|
| ### Environment reward accumulation via callback |
| <!-- since: F010 | test: tests/unit/test_trl_adapter.py::test_reward_accumulation --> |
|
|
| Each adapter instance accumulates per-step reward during an episode, and a module-level reward callback reads those values and returns `list[float]` in environment order for TRL reward ingestion. |
|
|
| ### Episode state isolation across resets and concurrent instances |
| <!-- since: F010 | test: tests/unit/test_trl_adapter.py::test_reset_clears_state --> |
| |
| Each environment instance owns independent mutable episode state. Calling `reset()` clears reward and done flags for a fresh episode, preventing cross-episode leakage and avoiding cross-instance state sharing. |
| |
| ### build_trainer accepts environment_factory |
| <!-- since: F010 | previously: F006 | test: tests/unit/test_trl_adapter.py::test_build_trainer_environment_factory --> |
| |
| **Before:** `build_trainer` accepted a rollout-function path and passed custom rollout glue into trainer construction. |
| **After:** `build_trainer` accepts `environment_factory` and forwards the environment class directly to `GRPOTrainer`, with optional `configure()` pre-wiring from notebook config values. |
|
|
| The legacy rollout module remains in the repository for compatibility/reference but is no longer the training pipeline's default orchestration path. |
|
|