sql_env / specs /F005-VERIFICATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Verification Specification

Feature: F005 Generated from: specs/F005-VERIFICATION_INPUT.json Generated: 2026-03-27


1. Unit Tests

1.1 EpisodeResult (frozen dataclass)

Test Description Input Expected Category
test_episode_result_creation Happy path construction EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=5, error=None) All fields accessible, values match happy
test_episode_result_frozen Cannot mutate after creation Attempt result.correct = False FrozenInstanceError raised edge
test_episode_result_with_error Episode that failed EpisodeResult(episode_index=1, correct=False, total_reward=0.0, steps=0, error="connection error") error field is "connection error" error
test_episode_result_error_default_none Error field defaults to None EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=3) error is None happy

Run: uv run pytest tests/unit/test_evaluation.py -v -k "EpisodeResult"

1.2 EvaluationResult (frozen dataclass)

Test Description Input Expected Category
test_evaluation_result_creation Happy path with episodes EvaluationResult(success_rate=0.5, avg_reward=0.75, avg_steps=3.0, n_episodes=2, n_completed=2, episodes=[...]) All fields match happy
test_evaluation_result_frozen Cannot mutate after creation Attempt result.success_rate = 1.0 FrozenInstanceError raised edge
test_evaluation_result_empty_episodes Zero episodes edge case EvaluationResult(success_rate=0.0, avg_reward=0.0, avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[]) Valid construction, all zeros edge
test_evaluation_result_partial_completion Some episodes failed n_episodes=10, n_completed=7 n_completed < n_episodes allowed edge
test_evaluation_result_success_rate_bounds Success rate between 0 and 1 success_rate=0.0 and success_rate=1.0 Both valid edge

Run: uv run pytest tests/unit/test_evaluation.py -v -k "EvaluationResult"

1.3 Policy Protocol

Test Description Input Expected Category
test_policy_protocol_compliance Object with select_action satisfies Policy Custom class with select_action(obs) -> SQLAction isinstance(obj, Policy) or structural match happy
test_policy_protocol_missing_method Object without select_action Plain object Does NOT satisfy Protocol error

Run: uv run pytest tests/unit/test_evaluation.py -v -k "Policy"

1.4 RandomPolicy.init

Test Description Input Expected Category
test_random_policy_default_seed No seed provided RandomPolicy() Constructs successfully happy
test_random_policy_with_seed Explicit seed RandomPolicy(seed=42) Constructs successfully happy
test_random_policy_none_seed Explicit None seed RandomPolicy(seed=None) Constructs successfully happy

Run: uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_init or random_policy_default or random_policy_with_seed or random_policy_none"

1.5 RandomPolicy.select_action

Test Description Input Expected Category
test_random_policy_explores_when_budget_gt_1 Budget > 1 means exploration Observation with budget_remaining=10 Returns SQLAction with action_type in {DESCRIBE, SAMPLE, QUERY} happy
test_random_policy_answers_when_budget_eq_1 Budget == 1 forces ANSWER Observation with budget_remaining=1 Returns SQLAction with action_type == "ANSWER" happy
test_random_policy_returns_sql_action Return type is correct Any valid observation isinstance(result, SQLAction) happy
test_random_policy_deterministic_with_seed Same seed produces same actions Two RandomPolicy(seed=42) with identical observations Same sequence of actions happy
test_random_policy_varies_without_seed Different runs produce different actions (probabilistic) Multiple calls without seed Not all actions identical (run 50 times) edge
test_random_policy_explores_all_action_types Over many calls, all exploration types appear Run 100 times with budget > 1 DESCRIBE, SAMPLE, and QUERY each appear at least once edge

Run: uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_select"

1.6 evaluate()

Test Description Input Expected Category
test_evaluate_happy_path Run N episodes successfully evaluate(env, policy, n_episodes=5) Returns EvaluationResult with n_episodes=5, n_completed=5 happy
test_evaluate_returns_evaluation_result Return type correct Any valid call isinstance(result, EvaluationResult) happy
test_evaluate_default_n_episodes Default is 100 evaluate(env, policy) result.n_episodes == 100 happy
test_evaluate_n_episodes_zero Zero episodes evaluate(env, policy, n_episodes=0) EvaluationResult with all zeros, empty episodes list edge
test_evaluate_negative_n_episodes Negative episodes evaluate(env, policy, n_episodes=-1) Raises ValueError error
test_evaluate_success_rate_calculation Correct fraction Policy that answers correctly 3 out of 5 times success_rate == 0.6 happy
test_evaluate_avg_reward_calculation Mean reward correct Known rewards per episode avg_reward matches manual calculation happy
test_evaluate_avg_steps_calculation Mean steps correct Known steps per episode avg_steps matches manual calculation happy
test_evaluate_episodes_list_length Per-episode breakdown n_episodes=5 len(result.episodes) == 5 happy
test_evaluate_episode_indices 0-based episode indices n_episodes=3 [e.episode_index for e in result.episodes] == [0, 1, 2] happy
test_evaluate_seed_determinism Same seed produces same results Two calls with seed=42, n_episodes=10 Both EvaluationResults have identical success_rate, avg_reward, avg_steps happy
test_evaluate_seed_per_episode Episode i uses seed+i seed=100, n_episodes=3 env.reset called with seeds 100, 101, 102 (verify via mock) happy
test_evaluate_no_seed_variation No seed allows variation Two calls without seed Results may differ (non-deterministic) edge
test_evaluate_n_episodes_one Single episode n_episodes=1 Valid result with 1 episode edge
test_evaluate_large_n_episodes Large run n_episodes=500 Completes without error, correct counts edge

Run: uv run pytest tests/unit/test_evaluation.py -v -k "test_evaluate"

1.7 evaluate() -- Error Handling

Test Description Input Expected Category
test_evaluate_episode_exception_recorded Exception during episode is caught Policy that raises on episode 2 Episode 2 has correct=False, total_reward=0.0, steps=0, error=<message> error
test_evaluate_continues_after_exception Failed episode does not stop evaluation Exception on episode 1 of 5 n_episodes=5, all 5 episodes in result error
test_evaluate_n_completed_excludes_errors n_completed counts only successes 2 out of 5 episodes raise n_completed == 3 error
test_evaluate_averages_exclude_failed avg_reward/avg_steps from completed episodes only 3 completed with known values, 2 failed Averages match only the 3 completed error
test_evaluate_env_reset_exception Exception during env.reset() Mock env.reset() to raise on episode 3 Episode 3 recorded with error, others complete error
test_evaluate_policy_exception Exception during select_action() Mock policy.select_action() to raise Episode recorded with error, evaluation continues error
test_evaluate_env_step_exception Exception during env.step() Mock env.step() to raise Episode recorded with error, evaluation continues error
test_evaluate_all_episodes_fail Every episode fails Policy that always raises n_completed=0, success_rate=0.0, avg_reward=0.0, avg_steps=0.0 error

Run: uv run pytest tests/unit/test_evaluation.py -v -k "exception or error or fail"

1.8 evaluate() -- Progress Callback

Test Description Input Expected Category
test_evaluate_progress_callback_called Callback receives updates Mock callback, n_episodes=5 Callback called with (1,5), (2,5), (3,5), (4,5), (5,5) happy
test_evaluate_no_callback None callback is fine progress_callback=None No error happy
test_evaluate_callback_receives_correct_total Total matches n_episodes n_episodes=10 Every callback call has total=10 happy

Run: uv run pytest tests/unit/test_evaluation.py -v -k "callback"


2. Integration Tests

Flow: Full Evaluation with RandomPolicy

Step Action Expected Verification
1 Create SQLEnvironment with test DB and questions Environment loads successfully len(env.questions) > 0
2 Create RandomPolicy(seed=42) Policy created Object has select_action method
3 Call evaluate(env, RandomPolicy(seed=42), n_episodes=10, seed=0) Returns EvaluationResult result.n_episodes == 10
4 Verify all episodes recorded Per-episode breakdown present len(result.episodes) == 10
5 Verify aggregate metrics are consistent success_rate matches manual count result.success_rate == sum(e.correct for e in result.episodes) / 10
6 Verify avg_reward consistent avg_reward matches manual mean result.avg_reward == mean([e.total_reward for e in result.episodes if e.error is None])
7 Verify determinism Repeat with same seed Identical results

Run: uv run pytest tests/integration/test_evaluation_integration.py -v -k "full_evaluation"

Flow: Evaluation with Partial Failures

Step Action Expected Verification
1 Create environment and a policy that fails on specific episodes Setup complete --
2 Call evaluate(env, flaky_policy, n_episodes=5) Returns result with mix of successes and failures result.n_completed < result.n_episodes
3 Inspect failed episodes Have error field set any(e.error is not None for e in result.episodes)
4 Inspect successful episodes Have error=None Completed episodes have error is None and valid metrics

Run: uv run pytest tests/integration/test_evaluation_integration.py -v -k "partial_failure"

Flow: Zero Episodes

Step Action Expected Verification
1 Call evaluate(env, policy, n_episodes=0) Returns zero-state result All aggregate values are 0.0, episodes list is empty

Run: uv run pytest tests/integration/test_evaluation_integration.py -v -k "zero_episodes"


3. API Tests

No API endpoints defined for F005. This section is intentionally empty.


4. E2E Tests

Scenario: Single-Command Evaluation of Random Baseline

Setup: SQLEnvironment initialized with Spider-format test database and questions file. Actions: Call evaluate(env, RandomPolicy(seed=42), n_episodes=20, seed=0) and inspect output. Expected:

  • Returns EvaluationResult with n_episodes=20
  • success_rate is a float in [0.0, 1.0]
  • avg_reward is a float
  • avg_steps is a positive float
  • n_completed == 20 (no errors with valid env + RandomPolicy)
  • All 20 EpisodeResult entries present with valid fields
  • Deterministic: re-running with same seeds yields identical results

Run: uv run pytest tests/e2e/test_evaluation_e2e.py -v

Scenario: Comparison of Two Policies

Setup: SQLEnvironment with test data. Actions:

  1. Evaluate RandomPolicy(seed=1) over 20 episodes
  2. Evaluate a "always answer immediately" policy over 20 episodes
  3. Compare results Expected:
  • Both return valid EvaluationResult
  • Results are structurally comparable (same fields)
  • Metrics differ between policies

Run: uv run pytest tests/e2e/test_evaluation_e2e.py -v -k "comparison"


5. Edge Cases Checklist

  • n_episodes = 0 returns zero-valued EvaluationResult with empty episodes list
  • n_episodes = -1 raises ValueError immediately
  • n_episodes = 1 works correctly (single episode)
  • All episodes fail -- n_completed=0, averages are 0.0, success_rate is 0.0
  • Exception during env.reset() is caught and recorded
  • Exception during policy.select_action() is caught and recorded
  • Exception during env.step() is caught and recorded
  • RandomPolicy with budget_remaining=1 always returns ANSWER
  • RandomPolicy with budget_remaining > 1 never returns ANSWER
  • Seed determinism: same seed + same n_episodes = identical EvaluationResult
  • Per-episode seeding: episode i uses seed+i for env.reset()
  • Progress callback receives (current, total) for each episode
  • Progress callback=None does not cause errors
  • EpisodeResult and EvaluationResult are frozen (immutable)
  • Large n_episodes (500+) completes without memory issues
  • success_rate is always in [0.0, 1.0]
  • avg_reward and avg_steps computed only from completed (non-error) episodes

6. Evidence Requirements

Category Evidence Type Example
Unit tests pytest output X passed from uv run pytest tests/unit/test_evaluation.py -v
Integration pytest output X passed from uv run pytest tests/integration/test_evaluation_integration.py -v
E2E pytest output X passed from uv run pytest tests/e2e/test_evaluation_e2e.py -v
Edge cases pytest output All edge-case tests in checklist pass
Determinism pytest output Seed-based tests produce identical results across runs