Verification Specification
Feature: F005 Generated from: specs/F005-VERIFICATION_INPUT.json Generated: 2026-03-27
1. Unit Tests
1.1 EpisodeResult (frozen dataclass)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_episode_result_creation | Happy path construction | EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=5, error=None) |
All fields accessible, values match | happy |
| test_episode_result_frozen | Cannot mutate after creation | Attempt result.correct = False |
FrozenInstanceError raised |
edge |
| test_episode_result_with_error | Episode that failed | EpisodeResult(episode_index=1, correct=False, total_reward=0.0, steps=0, error="connection error") |
error field is "connection error" |
error |
| test_episode_result_error_default_none | Error field defaults to None | EpisodeResult(episode_index=0, correct=True, total_reward=1.0, steps=3) |
error is None |
happy |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "EpisodeResult"
1.2 EvaluationResult (frozen dataclass)
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_evaluation_result_creation | Happy path with episodes | EvaluationResult(success_rate=0.5, avg_reward=0.75, avg_steps=3.0, n_episodes=2, n_completed=2, episodes=[...]) |
All fields match | happy |
| test_evaluation_result_frozen | Cannot mutate after creation | Attempt result.success_rate = 1.0 |
FrozenInstanceError raised |
edge |
| test_evaluation_result_empty_episodes | Zero episodes edge case | EvaluationResult(success_rate=0.0, avg_reward=0.0, avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[]) |
Valid construction, all zeros | edge |
| test_evaluation_result_partial_completion | Some episodes failed | n_episodes=10, n_completed=7 |
n_completed < n_episodes allowed |
edge |
| test_evaluation_result_success_rate_bounds | Success rate between 0 and 1 | success_rate=0.0 and success_rate=1.0 |
Both valid | edge |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "EvaluationResult"
1.3 Policy Protocol
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_policy_protocol_compliance | Object with select_action satisfies Policy | Custom class with select_action(obs) -> SQLAction |
isinstance(obj, Policy) or structural match |
happy |
| test_policy_protocol_missing_method | Object without select_action | Plain object | Does NOT satisfy Protocol | error |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "Policy"
1.4 RandomPolicy.init
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_random_policy_default_seed | No seed provided | RandomPolicy() |
Constructs successfully | happy |
| test_random_policy_with_seed | Explicit seed | RandomPolicy(seed=42) |
Constructs successfully | happy |
| test_random_policy_none_seed | Explicit None seed | RandomPolicy(seed=None) |
Constructs successfully | happy |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_init or random_policy_default or random_policy_with_seed or random_policy_none"
1.5 RandomPolicy.select_action
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_random_policy_explores_when_budget_gt_1 | Budget > 1 means exploration | Observation with budget_remaining=10 |
Returns SQLAction with action_type in {DESCRIBE, SAMPLE, QUERY} |
happy |
| test_random_policy_answers_when_budget_eq_1 | Budget == 1 forces ANSWER | Observation with budget_remaining=1 |
Returns SQLAction with action_type == "ANSWER" |
happy |
| test_random_policy_returns_sql_action | Return type is correct | Any valid observation | isinstance(result, SQLAction) |
happy |
| test_random_policy_deterministic_with_seed | Same seed produces same actions | Two RandomPolicy(seed=42) with identical observations | Same sequence of actions | happy |
| test_random_policy_varies_without_seed | Different runs produce different actions (probabilistic) | Multiple calls without seed | Not all actions identical (run 50 times) | edge |
| test_random_policy_explores_all_action_types | Over many calls, all exploration types appear | Run 100 times with budget > 1 | DESCRIBE, SAMPLE, and QUERY each appear at least once | edge |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "random_policy_select"
1.6 evaluate()
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_evaluate_happy_path | Run N episodes successfully | evaluate(env, policy, n_episodes=5) |
Returns EvaluationResult with n_episodes=5, n_completed=5 |
happy |
| test_evaluate_returns_evaluation_result | Return type correct | Any valid call | isinstance(result, EvaluationResult) |
happy |
| test_evaluate_default_n_episodes | Default is 100 | evaluate(env, policy) |
result.n_episodes == 100 |
happy |
| test_evaluate_n_episodes_zero | Zero episodes | evaluate(env, policy, n_episodes=0) |
EvaluationResult with all zeros, empty episodes list |
edge |
| test_evaluate_negative_n_episodes | Negative episodes | evaluate(env, policy, n_episodes=-1) |
Raises ValueError |
error |
| test_evaluate_success_rate_calculation | Correct fraction | Policy that answers correctly 3 out of 5 times | success_rate == 0.6 |
happy |
| test_evaluate_avg_reward_calculation | Mean reward correct | Known rewards per episode | avg_reward matches manual calculation |
happy |
| test_evaluate_avg_steps_calculation | Mean steps correct | Known steps per episode | avg_steps matches manual calculation |
happy |
| test_evaluate_episodes_list_length | Per-episode breakdown | n_episodes=5 |
len(result.episodes) == 5 |
happy |
| test_evaluate_episode_indices | 0-based episode indices | n_episodes=3 |
[e.episode_index for e in result.episodes] == [0, 1, 2] |
happy |
| test_evaluate_seed_determinism | Same seed produces same results | Two calls with seed=42, n_episodes=10 |
Both EvaluationResults have identical success_rate, avg_reward, avg_steps |
happy |
| test_evaluate_seed_per_episode | Episode i uses seed+i | seed=100, n_episodes=3 |
env.reset called with seeds 100, 101, 102 (verify via mock) | happy |
| test_evaluate_no_seed_variation | No seed allows variation | Two calls without seed | Results may differ (non-deterministic) | edge |
| test_evaluate_n_episodes_one | Single episode | n_episodes=1 |
Valid result with 1 episode | edge |
| test_evaluate_large_n_episodes | Large run | n_episodes=500 |
Completes without error, correct counts | edge |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "test_evaluate"
1.7 evaluate() -- Error Handling
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_evaluate_episode_exception_recorded | Exception during episode is caught | Policy that raises on episode 2 | Episode 2 has correct=False, total_reward=0.0, steps=0, error=<message> |
error |
| test_evaluate_continues_after_exception | Failed episode does not stop evaluation | Exception on episode 1 of 5 | n_episodes=5, all 5 episodes in result |
error |
| test_evaluate_n_completed_excludes_errors | n_completed counts only successes | 2 out of 5 episodes raise | n_completed == 3 |
error |
| test_evaluate_averages_exclude_failed | avg_reward/avg_steps from completed episodes only | 3 completed with known values, 2 failed | Averages match only the 3 completed | error |
| test_evaluate_env_reset_exception | Exception during env.reset() | Mock env.reset() to raise on episode 3 | Episode 3 recorded with error, others complete | error |
| test_evaluate_policy_exception | Exception during select_action() | Mock policy.select_action() to raise | Episode recorded with error, evaluation continues | error |
| test_evaluate_env_step_exception | Exception during env.step() | Mock env.step() to raise | Episode recorded with error, evaluation continues | error |
| test_evaluate_all_episodes_fail | Every episode fails | Policy that always raises | n_completed=0, success_rate=0.0, avg_reward=0.0, avg_steps=0.0 |
error |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "exception or error or fail"
1.8 evaluate() -- Progress Callback
| Test | Description | Input | Expected | Category |
|---|---|---|---|---|
| test_evaluate_progress_callback_called | Callback receives updates | Mock callback, n_episodes=5 |
Callback called with (1,5), (2,5), (3,5), (4,5), (5,5) |
happy |
| test_evaluate_no_callback | None callback is fine | progress_callback=None |
No error | happy |
| test_evaluate_callback_receives_correct_total | Total matches n_episodes | n_episodes=10 |
Every callback call has total=10 |
happy |
Run: uv run pytest tests/unit/test_evaluation.py -v -k "callback"
2. Integration Tests
Flow: Full Evaluation with RandomPolicy
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Create SQLEnvironment with test DB and questions | Environment loads successfully | len(env.questions) > 0 |
| 2 | Create RandomPolicy(seed=42) |
Policy created | Object has select_action method |
| 3 | Call evaluate(env, RandomPolicy(seed=42), n_episodes=10, seed=0) |
Returns EvaluationResult | result.n_episodes == 10 |
| 4 | Verify all episodes recorded | Per-episode breakdown present | len(result.episodes) == 10 |
| 5 | Verify aggregate metrics are consistent | success_rate matches manual count | result.success_rate == sum(e.correct for e in result.episodes) / 10 |
| 6 | Verify avg_reward consistent | avg_reward matches manual mean | result.avg_reward == mean([e.total_reward for e in result.episodes if e.error is None]) |
| 7 | Verify determinism | Repeat with same seed | Identical results |
Run: uv run pytest tests/integration/test_evaluation_integration.py -v -k "full_evaluation"
Flow: Evaluation with Partial Failures
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Create environment and a policy that fails on specific episodes | Setup complete | -- |
| 2 | Call evaluate(env, flaky_policy, n_episodes=5) |
Returns result with mix of successes and failures | result.n_completed < result.n_episodes |
| 3 | Inspect failed episodes | Have error field set | any(e.error is not None for e in result.episodes) |
| 4 | Inspect successful episodes | Have error=None | Completed episodes have error is None and valid metrics |
Run: uv run pytest tests/integration/test_evaluation_integration.py -v -k "partial_failure"
Flow: Zero Episodes
| Step | Action | Expected | Verification |
|---|---|---|---|
| 1 | Call evaluate(env, policy, n_episodes=0) |
Returns zero-state result | All aggregate values are 0.0, episodes list is empty |
Run: uv run pytest tests/integration/test_evaluation_integration.py -v -k "zero_episodes"
3. API Tests
No API endpoints defined for F005. This section is intentionally empty.
4. E2E Tests
Scenario: Single-Command Evaluation of Random Baseline
Setup: SQLEnvironment initialized with Spider-format test database and questions file.
Actions: Call evaluate(env, RandomPolicy(seed=42), n_episodes=20, seed=0) and inspect output.
Expected:
- Returns EvaluationResult with
n_episodes=20 success_rateis a float in [0.0, 1.0]avg_rewardis a floatavg_stepsis a positive floatn_completed == 20(no errors with valid env + RandomPolicy)- All 20 EpisodeResult entries present with valid fields
- Deterministic: re-running with same seeds yields identical results
Run: uv run pytest tests/e2e/test_evaluation_e2e.py -v
Scenario: Comparison of Two Policies
Setup: SQLEnvironment with test data. Actions:
- Evaluate RandomPolicy(seed=1) over 20 episodes
- Evaluate a "always answer immediately" policy over 20 episodes
- Compare results Expected:
- Both return valid EvaluationResult
- Results are structurally comparable (same fields)
- Metrics differ between policies
Run: uv run pytest tests/e2e/test_evaluation_e2e.py -v -k "comparison"
5. Edge Cases Checklist
- n_episodes = 0 returns zero-valued EvaluationResult with empty episodes list
- n_episodes = -1 raises ValueError immediately
- n_episodes = 1 works correctly (single episode)
- All episodes fail -- n_completed=0, averages are 0.0, success_rate is 0.0
- Exception during env.reset() is caught and recorded
- Exception during policy.select_action() is caught and recorded
- Exception during env.step() is caught and recorded
- RandomPolicy with budget_remaining=1 always returns ANSWER
- RandomPolicy with budget_remaining > 1 never returns ANSWER
- Seed determinism: same seed + same n_episodes = identical EvaluationResult
- Per-episode seeding: episode i uses seed+i for env.reset()
- Progress callback receives (current, total) for each episode
- Progress callback=None does not cause errors
- EpisodeResult and EvaluationResult are frozen (immutable)
- Large n_episodes (500+) completes without memory issues
- success_rate is always in [0.0, 1.0]
- avg_reward and avg_steps computed only from completed (non-error) episodes
6. Evidence Requirements
| Category | Evidence Type | Example |
|---|---|---|
| Unit tests | pytest output | X passed from uv run pytest tests/unit/test_evaluation.py -v |
| Integration | pytest output | X passed from uv run pytest tests/integration/test_evaluation_integration.py -v |
| E2E | pytest output | X passed from uv run pytest tests/e2e/test_evaluation_e2e.py -v |
| Edge cases | pytest output | All edge-case tests in checklist pass |
| Determinism | pytest output | Seed-based tests produce identical results across runs |