| # Implementation Specification |
|
|
| **Change:** F005 -- Green Agent Wrapper (automated evaluation) |
| **Date:** 2026-03-27 |
| **Research Summary:** [specs/F005-RESEARCH_SUMMARY.md](F005-RESEARCH_SUMMARY.md) |
| **Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner) |
| **Behavior Spec:** Archived to [specs/behavior/evaluation.md](behavior/evaluation.md) |
| |
| **Plan Status:** |
| - [x] Draft |
| - [x] Approved for Implementation |
| - [x] Implementation Complete |
| - [x] Verification Passed |
| |
| --- |
| |
| ## Core Intent (Immutable) |
| |
| > **DO NOT MODIFY THIS SECTION DURING REFINEMENT** |
| > Changes to Core Intent mean you're describing a different feature. |
| > If refinement reveals the need to change this section, create a new feature instead. |
| |
| **User Problem:** |
| Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained). |
| |
| **Success Criteria:** |
| - Single function call: `evaluate(n_episodes=100)` returns clean metrics dict |
| - Built-in random policy for instant baseline comparison |
| - Results include per-episode breakdown for analysis |
|
|
| **Avoid:** |
| - Evaluation crashes partway through and loses all results |
| - No progress indicator for long evaluation runs |
|
|
| **Out of Scope:** |
| - Visualization / plotting of results |
| - WebSocket / remote environment support (local SQLEnvironment only) |
| - Elaborate policy class hierarchy |
| - Training loop integration (F006 will consume this API) |
|
|
| --- |
|
|
| ## 0. Slicing & Scope Budget (Anti-Waterfall) |
|
|
| This spec must be executable in **small, mergeable increments**. |
|
|
| ### Scope Budget |
| - Target: **2 slices** |
| - Hard max: **<= 10 steps total** |
| - Each step must end in: **implement -> verify -> merge** |
|
|
| ### Slice Definition |
| A slice is a vertical increment that delivers user-visible value or a safe internal capability. |
|
|
| **Each slice must have:** |
| - Clear outcome |
| - Minimal interface change |
| - Merge criteria |
|
|
| **Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent). |
| |
| ## Status Icons |
| |
| **Step Status:** |
| - ??? Not Started |
| - ???? In Progress |
| - ??? Completed |
| - ???? Blocked/Failed |
| |
| **Result Outcome:** |
| - ??? Fully Successful (all tests passed, no issues) |
| - ?????? Completed with Issues (needs follow-up) |
| - ???? Failed/Blocked |
| |
| --- |
| |
| ## 1. Implementation Overview |
| |
| ### Summary |
| |
| Create an `evaluation/` subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a `Policy` protocol defining the interface for any policy, (2) an `EpisodeResult` dataclass for per-episode metrics, (3) an `EvaluationResult` dataclass for aggregate metrics, (4) a `RandomPolicy` class as a built-in baseline, and (5) an `evaluate()` function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified. |
| |
| ### Scope |
| |
| **In Scope:** |
| - `evaluation/__init__.py` -- public API re-exports |
| - `evaluation/green_agent.py` -- Protocol, dataclasses, RandomPolicy, evaluate() |
| - `tests/test_evaluation.py` -- unit + integration tests |
|
|
| **Out of Scope:** |
| - Modifications to `server/sql_environment.py` or `models.py` |
| - CLI entry point (future feature) |
| - Remote / WebSocket evaluation |
| - Plotting or visualization |
|
|
| --- |
|
|
| ## 1a. Execution Status |
| <!-- Auto-updated by /autocode-next-step - do not edit manually --> |
|
|
| **Progress:** 4/4 steps complete |
| **Current Step:** All planned implementation steps are complete |
| **Last Updated:** 2026-03-28T00:04:03Z |
| **Latest Result:** Fully Successful (Step 2.2 complete) |
| **Blockers:** None |
|
|
| --- |
|
|
| ## 1b. Risk Assessment |
|
|
| **Risk Tier:** [x] Low | [ ] Medium | [ ] High |
|
|
| **High-Risk Indicators Present:** (check all that apply if tier is High) |
| - [ ] Touches authentication or authorization logic |
| - [ ] Handles payment processing or financial data |
| - [ ] Manages secrets, API keys, or credentials |
| - [ ] Processes untrusted user input (file uploads, external APIs) |
| - [ ] Modifies privilege/permission systems |
|
|
| **Security Review Required:** [ ] Yes (if High) | [x] No |
|
|
| **Justification:** |
| Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns. |
|
|
| --- |
|
|
| ## 2. Change Manifest |
|
|
| ### Files to Create |
|
|
| | File | Purpose | |
| |------|---------| |
| | `evaluation/__init__.py` | Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate | |
| | `evaluation/green_agent.py` | Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate() | |
| | `tests/test_evaluation.py` | Unit tests for types + RandomPolicy, integration test with SQLEnvironment | |
|
|
| ### Files to Modify |
|
|
| None. |
|
|
| ### Files to Delete |
|
|
| None. |
|
|
| --- |
|
|
| ## 3. Interface Specifications |
|
|
| ### New Types |
|
|
| ```python |
| # Location: evaluation/green_agent.py |
| |
| from dataclasses import dataclass, field |
| from typing import Protocol, runtime_checkable |
| |
| @runtime_checkable |
| class Policy(Protocol): |
| """Interface for any evaluation policy. |
| |
| Any object with a select_action method matching this signature |
| is a valid policy (structural subtyping / duck typing). |
| """ |
| |
| def select_action(self, observation: SQLObservation) -> SQLAction: |
| """Choose an action given an observation.""" |
| ... |
| |
| |
| @dataclass(frozen=True) |
| class EpisodeResult: |
| """Per-episode evaluation metrics.""" |
| |
| episode_index: int # 0-based episode number |
| correct: bool # Whether ANSWER action matched gold |
| total_reward: float # Cumulative reward for the episode |
| steps: int # Number of steps taken |
| error: str | None = None # Error message if episode failed |
| |
| |
| @dataclass(frozen=True) |
| class EvaluationResult: |
| """Aggregate evaluation metrics with per-episode breakdown.""" |
| |
| success_rate: float # Fraction of correct episodes [0.0, 1.0] |
| avg_reward: float # Mean total_reward across episodes |
| avg_steps: float # Mean steps across episodes |
| n_episodes: int # Number of episodes attempted |
| n_completed: int # Episodes that ran to completion (no error) |
| episodes: list[EpisodeResult] # Per-episode breakdown |
| ``` |
|
|
| ### New Functions |
|
|
| ```python |
| # Location: evaluation/green_agent.py |
| |
| class RandomPolicy: |
| """Built-in random baseline policy. |
| |
| Selects random action types and arguments. Deterministic given a seed. |
| """ |
| |
| def __init__(self, seed: int | None = None) -> None: |
| """ |
| Args: |
| seed: Random seed for reproducibility. None = non-deterministic. |
| """ |
| |
| def select_action(self, observation: SQLObservation) -> SQLAction: |
| """Pick a random action based on current observation. |
| |
| Strategy: |
| - If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY |
| - If budget_remaining == 1: always ANSWER with a random guess |
| - DESCRIBE/SAMPLE: pick a random table from schema_info |
| - QUERY: generate a simple SELECT * FROM <table> LIMIT 5 |
| - ANSWER: pick a random value from last result or "unknown" |
| |
| Args: |
| observation: Current environment observation |
| |
| Returns: |
| A random SQLAction |
| """ |
| |
| |
| def evaluate( |
| env: SQLEnvironment, |
| policy: Policy, |
| n_episodes: int = 100, |
| *, |
| seed: int | None = None, |
| progress_callback: Callable[[int, int], None] | None = None, |
| ) -> EvaluationResult: |
| """Run automated evaluation of a policy over multiple episodes. |
| |
| Collects results incrementally -- if an episode fails, it is recorded |
| as an error and evaluation continues with the next episode. |
| |
| Args: |
| env: The SQLEnvironment instance to evaluate against. |
| policy: Any object satisfying the Policy protocol. |
| n_episodes: Number of episodes to run (0 returns empty result). |
| seed: Base seed for reproducibility. Episode i uses seed+i. |
| progress_callback: Optional callback(current, total) for progress. |
| |
| Returns: |
| EvaluationResult with aggregate metrics and per-episode breakdown. |
| |
| Raises: |
| ValueError: If n_episodes < 0. |
| """ |
| ``` |
|
|
| --- |
|
|
| ## 4. Data Flow |
|
|
| ### Primary Flow |
|
|
| ``` |
| 1. evaluate(env, policy, n_episodes=100, seed=42) |
| - Input: environment, policy, episode count, optional seed |
| |
| 2. For each episode i in range(n_episodes): |
| a. obs = env.reset(seed=seed+i if seed else None) |
| b. While not obs.done: |
| - action = policy.select_action(obs) |
| - obs = env.step(action) |
| - Accumulate reward |
| c. Record EpisodeResult(correct=..., total_reward=..., steps=...) |
| d. Call progress_callback(i+1, n_episodes) if provided |
| |
| 3. Aggregate results: |
| - success_rate = sum(correct) / n_completed |
| - avg_reward = mean(total_reward) across completed |
| - avg_steps = mean(steps) across completed |
| |
| 4. Return EvaluationResult |
| ``` |
|
|
| ### Alternative Flows |
|
|
| **When n_episodes=0:** |
| ``` |
| 1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0, |
| avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[]) |
| ``` |
| |
| **When episode raises exception:** |
| ``` |
| 1. Catch exception in the episode loop |
| 2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0, |
| error=str(exception)) |
| 3. Continue to next episode |
| ``` |
|
|
| **When env.reset() fails:** |
| ``` |
| 1. Catch exception |
| 2. Record EpisodeResult with error, steps=0 |
| 3. Continue to next episode |
| ``` |
|
|
| --- |
|
|
| ## 5. Error Handling |
|
|
| ### Error Types |
|
|
| | Error | When | Handling | |
| |-------|------|----------| |
| | `ValueError` | `n_episodes < 0` | Raise immediately | |
| | `Exception` during `env.reset()` | DB not found, bad questions file | Catch, record as failed episode, continue | |
| | `Exception` during `policy.select_action()` | Policy bug | Catch, record as failed episode, continue | |
| | `Exception` during `env.step()` | Environment bug | Catch, record as failed episode, continue | |
|
|
| ### Error Handling Strategy |
|
|
| ```python |
| # Pattern: incremental collection with per-episode error isolation |
| for i in range(n_episodes): |
| try: |
| obs = env.reset(seed=episode_seed) |
| total_reward = 0.0 |
| steps = 0 |
| while not obs.done: |
| action = policy.select_action(obs) |
| obs = env.step(action) |
| total_reward += obs.reward or 0.0 |
| steps += 1 |
| episodes.append(EpisodeResult( |
| episode_index=i, |
| correct=_check_correct(obs), |
| total_reward=total_reward, |
| steps=steps, |
| )) |
| except Exception as exc: |
| episodes.append(EpisodeResult( |
| episode_index=i, |
| correct=False, |
| total_reward=0.0, |
| steps=0, |
| error=str(exc), |
| )) |
| ``` |
|
|
| ### Retry Strategy |
|
|
| | Operation | Retry? | Strategy | |
| |-----------|--------|----------| |
| | Episode evaluation | No | Record error, move to next episode | |
| | Environment reset | No | Record error, move to next episode | |
|
|
| --- |
|
|
| ## 6. Slice Plan (What we will ship, in order) |
|
|
| ### Slice S1 -- Types, Protocol, and RandomPolicy |
| **Value:** Establishes the evaluation interface and provides a usable random baseline |
| **User-visible change:** Yes -- users can instantiate RandomPolicy and call select_action |
| **Interfaces introduced/changed:** Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy |
| **Rollback safety:** Purely additive -- new files only, no changes to existing code |
| |
| ### Slice S2 -- evaluate() Function and Integration Test |
| **Value:** Users can run `evaluate(env, random_policy, n_episodes=100)` and get structured metrics |
| **User-visible change:** Yes -- the core capability is now available |
| **Interfaces introduced/changed:** evaluate() function |
| **Rollback safety:** Purely additive -- extends S1 files, no changes to existing code |
| |
| --- |
| |
| ## 7. Implementation Steps |
| |
| > **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md. |
| > The verification-planner (separate agent) generated independent test criteria. |
| > Run the tests specified there after implementing each step. |
|
|
| ### Step 1.1: Types and Protocol |
| **Slice:** S1 |
| **Goal:** Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass. |
|
|
| **Files:** |
| - `evaluation/__init__.py` - create - empty init with re-exports |
| - `evaluation/green_agent.py` - create - Protocol + dataclasses (no functions yet) |
|
|
| **Interface Changes:** |
| - New: `Policy` protocol with `select_action(observation: SQLObservation) -> SQLAction` |
| - New: `EpisodeResult` frozen dataclass |
| - New: `EvaluationResult` frozen dataclass |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** ??? Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T23:51:09Z |
| **Changes Made:** |
| - Created `evaluation/__init__.py` with public re-exports for `Policy`, `EpisodeResult`, and `EvaluationResult`. |
| - Created `evaluation/green_agent.py` with the `Policy` runtime-checkable protocol and frozen `EpisodeResult`/`EvaluationResult` dataclasses. |
|
|
| **Result:** |
| - **Outcome:** ??? |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/ -v |
| Result: 100 passed, 1 skipped |
| Scope: full project regression run after adding new evaluation types |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/ -v` |
| - **Notes:** |
| - Dataclass and protocol scaffolding is additive and isolated to a new package. |
| - `pytest` is not installed in the project environment yet, so verification used `uv run --with pytest` for this step. |
| - Import fallback mirrors existing package-vs-standalone test collection behavior in the repo. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** ??? N/A |
|
|
| **Context for Next Step:** |
| - Types are defined and importable from `evaluation` |
|
|
| --- |
|
|
| ### Step 1.2: RandomPolicy Implementation |
| **Slice:** S1 |
| **Goal:** Implement the RandomPolicy class that selects random actions based on observation state. |
|
|
| **Files:** |
| - `evaluation/green_agent.py` - modify - add RandomPolicy class |
|
|
| **Interface Changes:** |
| - New: `RandomPolicy.__init__(seed: int | None = None)` |
| - New: `RandomPolicy.select_action(observation: SQLObservation) -> SQLAction` |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** ??? Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T23:55:10Z |
| **Changes Made:** |
| - Implemented `RandomPolicy` in `evaluation/green_agent.py` with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction. |
| - Updated `evaluation/__init__.py` to re-export `RandomPolicy` from the public evaluation API. |
| - Added `tests/test_evaluation.py` with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction). |
|
|
| **Result:** |
| - **Outcome:** ??? |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/test_evaluation.py -v |
| Result: 6 passed |
| Scope: RandomPolicy unit coverage for F005 Step 1.2 |
| |
| Command: uv run --with pytest pytest tests/ -v |
| Result: 106 passed, 1 skipped |
| Scope: Full regression after RandomPolicy implementation |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v` |
| - **Notes:** |
| - RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step. |
| - Schema parsing intentionally handles both `- table` and `- table: columns...` observation formats. |
| - Verification commands in the spec referenced `tests/unit/...`; this repo uses a flat `tests/` layout, so tests were added in `tests/test_evaluation.py`. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** ??? N/A |
|
|
| **Context for Next Step:** |
| - RandomPolicy is implemented and exported from the public `evaluation` API |
| - Ready to implement `evaluate()` using per-episode loop and error isolation |
|
|
| --- |
|
|
| ### Step 2.1: evaluate() Function |
| **Slice:** S2 |
| **Goal:** Implement the core evaluate() function with incremental collection and error isolation. |
|
|
| **Files:** |
| - `evaluation/green_agent.py` - modify - add evaluate() function |
| - `evaluation/__init__.py` - modify - add evaluate to re-exports |
|
|
| **Interface Changes:** |
| - New: `evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult` |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** ??? Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T23:59:28Z |
| **Changes Made:** |
| - Added `evaluate()` to `evaluation/green_agent.py` with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation. |
| - Added `evaluate` to `evaluation/__init__.py` public exports. |
| - Extended `tests/test_evaluation.py` with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior. |
|
|
| **Result:** |
| - **Outcome:** ??? |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/test_evaluation.py -v |
| Result: 14 passed |
| Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1 |
| |
| Command: uv run --with pytest pytest tests/ -v |
| Result: 114 passed, 1 skipped |
| Scope: Full regression after evaluate() implementation |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v` |
| - **Notes:** |
| - evaluate() computes aggregates using completed episodes only (`error is None`), matching the error-isolation behavior in the spec data flow. |
| - Progress callback is invoked once per attempted episode, including episodes that fail. |
| - Repository environment still does not include pytest by default, so verification used `uv run --with pytest`. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** ??? N/A |
|
|
| **Context for Next Step:** |
| - evaluate() is implemented, exported, and covered by focused unit tests |
| - Next step should add/expand integration coverage with a real `SQLEnvironment` evaluation run |
|
|
| --- |
|
|
| ### Step 2.2: Integration Test with SQLEnvironment |
| **Slice:** S2 |
| **Goal:** Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment. |
|
|
| **Files:** |
| - `tests/test_evaluation.py` - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env |
|
|
| **Interface Changes:** |
| None (test-only step). |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-28T00:04:03Z |
| **Changes Made:** |
| - Added `_build_sql_environment()` test helper in `tests/test_evaluation.py` to spin up a real SQLite-backed `SQLEnvironment` with a deterministic question fixture. |
| - Added `test_evaluate_integration_with_sql_environment` validating end-to-end `evaluate()` execution over 10 episodes with aggregate-metric consistency checks. |
| - Added `test_evaluate_integration_is_deterministic_with_seeds` validating deterministic full-result equality when both policy and environment seeds are fixed. |
|
|
| **Result:** |
| - **Outcome:** Fully Successful |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/test_evaluation.py -v |
| Result: 16 passed |
| Scope: evaluation unit + integration coverage including real SQLEnvironment flow |
| |
| Command: uv run --with pytest pytest tests/ -v |
| Result: 116 passed, 1 skipped |
| Scope: full project regression after adding integration coverage |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v` |
| - **Notes:** |
| - Integration tests were implemented in `tests/test_evaluation.py` to match this repository's flat test layout. |
| - Verifier gate approved finalization in MVP mode after test evidence review. |
| - Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - All implementation steps are complete and verification gate passed. |
|
|
| --- |
|
|
| ## 8. Rollout Considerations |
|
|
| ### Feature Flags |
| - [x] Required: No |
| - [ ] Flag name: N/A |
|
|
| ### Migration |
| - [x] Data migration needed: No |
| - [ ] Migration strategy: N/A |
|
|
| ### Rollback Plan |
| Delete the `evaluation/` directory. No other code references it. |
|
|
| --- |
|
|
| ## 9. Execution Tracking |
|
|
| All execution state is tracked within this document: |
| - **Section 1a:** Overall progress summary |
| - **Section 7:** Per-step completion details, test results, and handoff context |
| - **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run` |
| - **Git history:** Full audit trail of changes to this file |
|
|
| The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by: |
| - Checking Section 1a for summary |
| - Reviewing Section 7 for detailed step status |
| - Inspecting the feature's `progress` and `status` fields in `FEATURES.json` |
| - Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history |
|
|
| --- |
|
|
| ## 9a. Slice Completion Protocol |
|
|
| After all steps in a slice pass verification: |
|
|
| 1. **Run verifier subagent** for spec compliance |
| - Validates against VERIFICATION_SPEC.md criteria |
| - Ensures no TODOs or incomplete work in slice |
| |
| 2. **Run compound-engineer subagent** to extract learnings |
| - **Mandatory invocation** after every slice completion |
| - Updates CLAUDE.md Learnings section (if durable patterns found) |
| - May exit with "no update needed" (valid for routine work) |
| |
| 3. **Commit** the slice changes |
| - Follow commit message format in CLAUDE.md |
| - Each slice gets its own atomic commit |
| |
| 4. **Continue to next slice** (if more slices remain) |
| - Or proceed to final verification if all slices complete |
| |
| **Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready. |
| |
| --- |
| |
| ## 10. User Value Summary |
| |
| <!-- Populated by /autocode-next-step when final step completes --> |
| |
| **Status:** Generated |
| |
| ### What Users Can Now Do |
| Run automated evaluation of any policy over N episodes with `evaluate(env, policy, n_episodes=100)` and get structured metrics including success rate, average reward, average steps, and per-episode breakdown. |
|
|
| ### How to Access/Test |
| ```python |
| from evaluation import evaluate, RandomPolicy |
| from server.sql_environment import SQLEnvironment |
| |
| env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer) |
| policy = RandomPolicy(seed=42) |
| result = evaluate(env, policy, n_episodes=10, seed=42) |
| print(f"Success rate: {result.success_rate:.1%}") |
| print(f"Avg reward: {result.avg_reward:.3f}") |
| ``` |
|
|
| ### Demo |
| - **Command:** `uv run python -c "from evaluation import evaluate, RandomPolicy; ..."` |
|
|
| ### Release Notes Snippet |
| Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance. |
|
|
| --- |
|
|
| ## 11. PR Contract (Auto-Generated by autocode-next-step) |
|
|
| <!-- This section is auto-populated by autocode-next-step command when all steps complete --> |
|
|
| **Status:** Generated |
|
|
| ### PR Title |
| feat(evaluation): complete green agent wrapper integration and finalization |
|
|
| ### PR Summary |
| - Add deterministic integration coverage for `evaluate()` against a real `SQLEnvironment` fixture. |
| - Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation. |
| - Capture durable learnings under `docs/learnings/` for evaluation patterns and deterministic testing. |
|
|
| ### Verification |
| - `uv run --with pytest pytest tests/test_evaluation.py -v` |
| - `uv run --with pytest pytest tests/ -v` |
|
|
| ### Follow-up |
| All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10 |
|
|
| --- |
|
|
| ## Stop Conditions (When to Split This Spec) |
|
|
| Stop and create a new IMPLEMENTATION_SPEC if: |
| - A step requires touching more than **3 files** in unrelated areas |
| - You need to introduce **multiple new abstractions** "just in case" |
| - Verification cannot be made targeted and concrete |
| - You discover new unknowns that change the plan materially |
| - The next slice cannot be merged safely without finishing later slices |
| |
| When splitting, ensure the current slice ends in a merged, stable state. |
| |
| --- |
| |
| ## Human Checkpoint |
| |
| **Before handing to AI agent:** |
| |
| - [ ] Interface specifications are complete |
| - [ ] Data flow is accurate |
| - [ ] Error handling is specified |
| - [ ] Implementation order makes sense |
| - [ ] VERIFICATION_SPEC.md has been generated |
|
|
| **Questions:** |
| 1. Any remaining concerns? |
| 2. Anything agent should know? |
|
|
| --- |
|
|
| ## Handoff Notes |
|
|
| **For the implementing AI agent:** |
|
|
| ``` |
| Context: See RESEARCH_SUMMARY.md for system understanding |
| Spec: Follow this document exactly |
| Verification: Use tests from VERIFICATION_SPEC.md (independent agent) |
| Ambiguity: Stop and ask rather than assume |
| Order: Follow implementation order exactly |
| ``` |
|
|
| --- |
|
|
| *Specification completed: 2026-03-27* |
| *Approved by: --* |
| *Verification spec: VERIFICATION_SPEC.md* |
| *Verification input: [F005-VERIFICATION_INPUT.json](F005-VERIFICATION_INPUT.json)* |
| *Target agent: Claude Code* |
|
|