Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /behavior /training.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 28 days ago

preview code

raw

history blame contribute delete

4.45 kB

System Behavior: Training

Living document. Updated by /archive-spec when features are completed. Last archived: F010 on 2026-03-28

Training Pipeline

Training notebook produces a trained model from one-click execution

The system accepts a notebooks/train_grpo.ipynb notebook that, when run end-to-end, downloads a HuggingFace model, trains it on SQLEnv episodes using GRPO, and saves the trained weights to a configurable output directory.

Training produces a learning curve showing reward improvement

After training completes, the notebook displays a matplotlib plot of reward over training steps, showing whether the model learned to improve its SQL exploration strategy over the course of training.

Training produces side-by-side episode transcripts

After training completes, the notebook displays episode transcripts comparing random-action baseline episodes against trained-model episodes on the same questions, showing the difference in exploration behavior.

Rollout function plays SQLEnv episodes via model generation

The system accepts a batch of question prompts and returns episode completions by playing full SQLEnv episodes: resetting the environment, generating actions with HF model.generate(), parsing them into SQLActions, and stepping the environment until the episode ends.

Reward functions return per-completion scores for GRPO training

The system accepts TRL-format completion batches and returns float reward lists from three independent callables: correctness (binary 0/1), progress (normalized cumulative progress), and operational (sum of per-step L1 signals).

Unparseable model output falls back to QUERY action

When the model produces text that cannot be parsed as ACTION_TYPE: argument format, the system defaults to a QUERY action with the raw text as the argument, allowing the episode to continue rather than crashing.

TRL environment_factory integration

The training system accepts a TRL-compatible environment class (SQLEnvTRL) as environment_factory for GRPOTrainer. TRL auto-discovers describe, sample, query, and answer as callable tools via typed docstrings and runs generation/tool-calling/multi-turn control flow without custom rollout glue.

Class-level environment configuration for no-arg factory construction

The adapter accepts environment configuration (questions_path, db_dir, step_budget) through a configure() classmethod before trainer construction, satisfying TRL's no-argument environment_factory instantiation contract.

Environment reward accumulation via callback

Each adapter instance accumulates per-step reward during an episode, and a module-level reward callback reads those values and returns list[float] in environment order for TRL reward ingestion.

Episode state isolation across resets and concurrent instances

Each environment instance owns independent mutable episode state. Calling reset() clears reward and done flags for a fresh episode, preventing cross-episode leakage and avoiding cross-instance state sharing.

build_trainer accepts environment_factory

Before: build_trainer accepted a rollout-function path and passed custom rollout glue into trainer construction. After: build_trainer accepts environment_factory and forwards the environment class directly to GRPOTrainer, with optional configure() pre-wiring from notebook config values.

The legacy rollout module remains in the repository for compatibility/reference but is no longer the training pipeline's default orchestration path.