sql_env / specs /behavior /evaluation.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified

System Behavior: evaluation

Living document. Updated by /archive-spec when features are completed. Last archived: F011 on 2026-04-07


Added

Automated multi-episode evaluation

The system accepts an environment, a policy, and an episode count, then produces an EvaluationResult containing success_rate, avg_reward, avg_steps, and a per-episode breakdown. Evaluation runs all requested episodes and returns structured metrics in a single call.

Incremental result collection on failure

When an individual episode fails (environment error or policy error), the system records the failure in the per-episode breakdown and continues evaluating remaining episodes. Partial results are never lost.

Random baseline policy

The system provides a built-in random policy that accepts an SQLObservation and returns a random SQLAction. Given the same seed, the random policy produces identical action sequences across runs.

Progress callback during evaluation

The evaluate function accepts an optional progress callback that receives (current_episode, total_episodes) after each episode completes, enabling progress reporting for long evaluation runs.

Oracle policy baseline available for evaluation

The evaluation module accepts an OraclePolicy that, given the same question list as the environment, produces a deterministic optimal action sequence per episode (DESCRIBE relevant tables, QUERY with gold SQL, ANSWER with gold answer). When run through evaluate(), the oracle returns near-perfect success rate and ~1.3 total reward, serving as an upper-bound baseline for comparison against random and trained policies.

Oracle graceful fallback on unknown questions

When the oracle encounters a question not present in its lookup, it returns an ANSWER action with an empty string rather than raising an error. The episode is marked incorrect but the evaluation run continues without interruption.

Compare-methods notebook produces prompting-vs-GRPO accuracy view

The system provides a notebook evaluation flow that runs a shared held-out subset and renders a side-by-side comparison table and bar chart for zero-shot, 1-shot, 3-shot, GRPO no-think, and GRPO thinking conditions.

GRPO checkpoint evaluation degrades gracefully when repos are unavailable

When a configured GRPO checkpoint cannot be loaded from HF Hub, the notebook emits a warning and skips that condition while continuing remaining evaluations so users still receive partial comparison output.

LLM tool-calling policy converts model outputs into SQL actions

The notebook policy feeds tool schemas through the chat template, parses <tool_call> JSON blocks, and maps them to structured SQLAction objects; unparseable generations fall back to an ANSWER action so evaluation continues without crashing.