| # Implementation Specification |
|
|
| **Change:** F002 -- Answer Verification (multi-type comparison) |
| **Date:** 2026-03-27 |
| **Research Summary:** [specs/F002-RESEARCH_SUMMARY.md](F002-RESEARCH_SUMMARY.md) |
| **Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner) |
| **Behavior Delta:** Archived into `specs/behavior/sql-environment.md` |
| |
| **Plan Status:** |
| - [x] Draft |
| - [x] Approved for Implementation |
| - [x] Implementation Complete |
| - [x] Verification Passed |
| |
| --- |
| |
| ## Core Intent (Immutable) |
| |
| > **DO NOT MODIFY THIS SECTION DURING REFINEMENT** |
| > Changes to Core Intent mean you're describing a different feature. |
| > If refinement reveals the need to change this section, create a new feature instead. |
| |
| **User Problem:** |
| When an agent submits ANSWER, the environment correctly determines if the answer matches the gold answer regardless of type (42 vs 42.0, 'Engineering' vs 'engineering', unordered lists). |
| |
| **Success Criteria:** |
| - Float comparison with tolerance handles rounding gracefully (95000.1 matches 95000) |
| - List comparison ignores order: ['A','B'] matches ['B','A'] |
| - Clear pass/fail with no ambiguity |
| |
| **Avoid:** |
| - Correct answer rejected due to trivial formatting difference |
| - Type coercion failures (agent says '42', gold is integer 42) |
| |
| **Out of Scope:** |
| - Table comparison (multi-column row overlap) -- deferred to post-MVP |
| - Partial credit scoring -- binary pass/fail only at this layer |
| - Changes to reward signal structure (F003 scope) |
| |
| --- |
| |
| ## 0. Slicing & Scope Budget (Anti-Waterfall) |
| |
| This spec must be executable in **small, mergeable increments**. |
| |
| ### Scope Budget |
| - Target: **2 slices** |
| - Hard max: **<= 10 steps total** |
| - Each step must end in: **implement -> verify -> merge** |
| |
| ### Slice Definition |
| A slice is a vertical increment that delivers user-visible value or a safe internal capability. |
| |
| **Each slice must have:** |
| - Clear outcome |
| - Minimal interface change |
| - Merge criteria |
| |
| **Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent). |
|
|
| ## Status Icons |
|
|
| **Step Status:** |
| - ??? Not Started |
| - ? In Progress |
| - ? Completed |
| - ? Blocked/Failed |
|
|
| **Result Outcome:** |
| - ? Fully Successful (all tests passed, no issues) |
| - ?? Completed with Issues (needs follow-up) |
| - ? Failed/Blocked |
|
|
| --- |
|
|
| ## 1. Implementation Overview |
|
|
| ### Summary |
| Implement `verify_answer()` in `server/verifier.py` with type-aware comparison dispatching across four answer types (integer, float, string, list). Wire it into `_handle_answer()` in `server/sql_environment.py`, replacing the naive string comparison. Add `gold_rows` field to `EpisodeContext` so the verifier receives raw data for accurate list comparison. Fallback to string comparison when `answer_type` is missing. |
|
|
| ### Scope |
|
|
| **In Scope:** |
| - `verify_answer()` public function with 4 type comparers |
| - Private helpers: `_normalize_value`, `_compare_integer`, `_compare_float`, `_compare_string`, `_compare_list` |
| - `gold_rows` field on `EpisodeContext` |
| - Integration into `_handle_answer()` |
| - Unit tests for all comparers and edge cases |
|
|
| **Out of Scope:** |
| - Table comparison (multi-column) |
| - Partial credit / dense reward (F003) |
| - Changes to question data schema (answer_type already exists) |
| - External dependencies (pure Python only) |
| |
| --- |
| |
| ## 1a. Execution Status |
| <!-- Auto-updated by /autocode-next-step - do not edit manually --> |
| |
| **Progress:** 4/4 steps complete |
| **Current Step:** None (all implementation steps complete) |
| **Last Updated:** 2026-03-27T22:33:12Z |
| **Latest Result:** Fully Successful (all tests passed, no issues) |
| **Blockers:** None |
| |
| --- |
| |
| ## 1b. Risk Assessment |
| |
| **Risk Tier:** Low |
| |
| **High-Risk Indicators Present:** (none apply) |
| - [ ] Touches authentication or authorization logic |
| - [ ] Handles payment processing or financial data |
| - [ ] Manages secrets, API keys, or credentials |
| - [ ] Processes untrusted user input (file uploads, external APIs) |
| - [ ] Modifies privilege/permission systems |
| |
| **Security Review Required:** No |
| |
| **Justification:** |
| Pure logic module that compares two values. No user input beyond agent's ANSWER string (already sanitized by action parsing). No I/O, no network, no secrets. |
| |
| --- |
| |
| ## 2. Change Manifest |
| |
| ### Files to Create |
| |
| | File | Purpose | |
| |------|---------| |
| | `tests/test_verifier.py` | Unit tests for all comparison types and edge cases | |
|
|
| ### Files to Modify |
|
|
| | File | Changes | |
| |------|---------| |
| | `server/verifier.py` | Replace stub with full `verify_answer()` + private helpers | |
| | `models.py` | Add `gold_rows: list[tuple] | None = None` to `EpisodeContext` | |
| | `server/sql_environment.py` | Wire `verify_answer()` into `_handle_answer()`, populate `gold_rows` | |
|
|
| ### Files to Delete |
|
|
| None. |
|
|
| --- |
|
|
| ## 3. Interface Specifications |
|
|
| ### Modified Types |
|
|
| ```python |
| # Location: models.py |
| # CHANGE: Add gold_rows field to EpisodeContext |
| |
| @dataclass |
| class EpisodeContext: |
| """Per-episode server-side state (never sent to agent).""" |
| |
| episode_id: str |
| db_connection: sqlite3.Connection |
| question_record: QuestionRecord |
| step_count: int = 0 |
| budget: int = 15 |
| described_tables: set[str] = dataclass_field(default_factory=set) |
| action_log: list[str] = dataclass_field(default_factory=list) |
| done: bool = False |
| gold_answer: str | None = None |
| gold_rows: list[tuple] | None = None # NEW: raw SQL result rows for verifier |
| ``` |
|
|
| ### New Functions |
|
|
| ```python |
| # Location: server/verifier.py |
| |
| def verify_answer( |
| predicted: str, |
| gold: str, |
| answer_type: str | None = None, |
| gold_rows: list[tuple] | None = None, |
| ) -> bool: |
| """ |
| Compare agent's submitted answer against the gold answer. |
| |
| Dispatches to type-specific comparers based on answer_type. |
| Falls back to string comparison when answer_type is None or unknown. |
| |
| Args: |
| predicted: The agent's submitted answer string. |
| gold: The gold answer as a formatted string. |
| answer_type: One of "integer", "float", "string", "list", or None. |
| gold_rows: Raw SQL result rows (list of tuples) for accurate list comparison. |
| |
| Returns: |
| True if the answer is correct, False otherwise. |
| """ |
| ``` |
|
|
| ```python |
| # Location: server/verifier.py (private helpers) |
| |
| def _normalize_value(value: str) -> str: |
| """Strip whitespace and lowercase a value for comparison.""" |
| |
| def _compare_integer(predicted: str, gold: str) -> bool: |
| """ |
| Compare as integers after coercing both sides. |
| |
| Handles: "42" vs 42, "42.0" vs 42. |
| Returns False on ValueError (non-numeric input). |
| """ |
| |
| def _compare_float(predicted: str, gold: str, tolerance: float = 0.01) -> bool: |
| """ |
| Compare as floats with relative tolerance (default 1%). |
| |
| Uses: abs(pred - gold) <= tolerance * abs(gold) when gold != 0. |
| For gold == 0: uses absolute tolerance of 1e-9. |
| Returns False on ValueError. |
| """ |
| |
| def _compare_string(predicted: str, gold: str) -> bool: |
| """Case-insensitive, whitespace-normalized string comparison.""" |
| |
| def _compare_list( |
| predicted: str, |
| gold: str, |
| gold_rows: list[tuple] | None = None, |
| ) -> bool: |
| """ |
| Order-insensitive set comparison. |
| |
| If gold_rows is provided, converts both sides to sets of normalized strings. |
| Otherwise parses the formatted string (split on ' | ' and newlines). |
| """ |
| ``` |
|
|
| ### Modified Functions |
|
|
| ```python |
| # Location: server/sql_environment.py |
| # CHANGE: Replace naive comparison with verify_answer() call |
| |
| def _handle_answer(self, value: str) -> tuple[bool, float]: |
| """Compare submitted answer against episode gold answer using type-aware verifier.""" |
| if self._episode is None: |
| raise RuntimeError("No active episode. Call reset() before step().") |
| |
| is_correct = verify_answer( |
| predicted=value, |
| gold=self._episode.gold_answer or "", |
| answer_type=self._episode.question_record.answer_type, |
| gold_rows=self._episode.gold_rows, |
| ) |
| self._episode.done = True |
| return is_correct, 1.0 if is_correct else 0.0 |
| ``` |
|
|
| --- |
|
|
| ## 4. Data Flow |
|
|
| ### Primary Flow |
|
|
| ``` |
| 1. Agent sends ANSWER action with value string |
| - Input: action.argument (str) |
| |
| 2. step() dispatches to _handle_answer(value) |
| - Input: value (str) |
| |
| 3. _handle_answer() calls verify_answer(predicted, gold, answer_type, gold_rows) |
| - predicted: value (agent's answer) |
| - gold: self._episode.gold_answer (formatted string) |
| - answer_type: self._episode.question_record.answer_type |
| - gold_rows: self._episode.gold_rows (raw tuples or None) |
| |
| 4. verify_answer() dispatches by answer_type: |
| - "integer" -> _compare_integer(predicted, gold) |
| - "float" -> _compare_float(predicted, gold) |
| - "string" -> _compare_string(predicted, gold) |
| - "list" -> _compare_list(predicted, gold, gold_rows) |
| - None/unknown -> _compare_string(predicted, gold) |
| |
| 5. Returns bool -> _handle_answer returns (bool, float reward) |
| ``` |
|
|
| ### Alternative Flows |
|
|
| **When answer_type is None or unknown:** |
| ``` |
| 1. verify_answer receives answer_type=None |
| 2. Falls back to _compare_string(predicted, gold) |
| 3. Returns bool (case-insensitive normalized comparison) |
| ``` |
| |
| **When predicted or gold is empty/None:** |
| ``` |
| 1. verify_answer receives empty string or None-coerced value |
| 2. Returns False immediately (no valid answer to compare) |
| ``` |
| |
| **When type coercion fails (e.g., "abc" as integer):** |
| ``` |
| 1. _compare_integer or _compare_float catches ValueError |
| 2. Falls back to returning False |
| ``` |
|
|
| --- |
|
|
| ## 5. Error Handling |
|
|
| ### Error Types |
|
|
| | Error | When | Behavior | |
| |-------|------|----------| |
| | `ValueError` (caught internally) | Predicted value cannot be coerced to int/float | Return False (not correct) | |
| | `RuntimeError` | `_handle_answer` called with no active episode | Raised to caller (existing behavior) | |
|
|
| ### Error Handling Strategy |
|
|
| ```python |
| # Pattern: catch coercion errors, return False (answer is wrong, not a crash) |
| def _compare_integer(predicted: str, gold: str) -> bool: |
| try: |
| return int(float(predicted)) == int(float(gold)) |
| except (ValueError, TypeError): |
| return False |
| ``` |
|
|
| ### Retry Strategy |
|
|
| | Operation | Retry? | Strategy | |
| |-----------|--------|----------| |
| | `verify_answer()` | No | Deterministic comparison, no transient failures | |
|
|
| --- |
|
|
| ## 6. Slice Plan (What we will ship, in order) |
|
|
| ### Slice S1 -- Core Verifier Module |
| **Value:** `verify_answer()` exists as a tested, standalone module with all 4 type comparers |
| **User-visible change:** No (not yet wired in) |
| **Interfaces introduced/changed:** `verify_answer()`, `_normalize_value()`, `_compare_integer()`, `_compare_float()`, `_compare_string()`, `_compare_list()` |
| **Rollback safety:** Additive only -- new file, no existing code changed |
|
|
| ### Slice S2 -- Integration and Wiring |
| **Value:** `_handle_answer()` uses type-aware verification; agents get correct results for float/list/integer answers |
| **User-visible change:** Yes -- agent answers previously rejected (e.g., "42" vs integer 42) now accepted |
| **Interfaces introduced/changed:** `EpisodeContext.gold_rows`, modified `_handle_answer()` |
| **Rollback safety:** Revert to naive string compare by removing import and restoring 3 lines |
|
|
| --- |
|
|
| ## 7. Implementation Steps |
|
|
| > **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md. |
| > The verification-planner (separate agent) generated independent test criteria. |
| > Run the tests specified there after implementing each step. |
| |
| ### Step 1.1: Implement verify_answer module |
| **Slice:** S1 |
| **Goal:** Create the complete `verify_answer()` function with all 4 type-specific comparers in `server/verifier.py`. |
|
|
| **Files:** |
| - `server/verifier.py` - modify - Replace stub with full implementation |
|
|
| **Interface Changes:** |
| - New public function: `verify_answer(predicted, gold, answer_type, gold_rows) -> bool` |
| - New private helpers: `_normalize_value`, `_compare_integer`, `_compare_float`, `_compare_string`, `_compare_list` |
|
|
| **Implementation Details:** |
| 1. Replace the docstring-only stub in `server/verifier.py` with the full module. |
| 2. `verify_answer()` uses match/case on `answer_type` to dispatch. |
| 3. `_normalize_value(value)`: `value.strip().lower()`. |
| 4. `_compare_integer(pred, gold)`: coerce both via `int(float(x))`, exact match. Catch ValueError -> False. |
| 5. `_compare_float(pred, gold, tolerance=0.01)`: relative tolerance `abs(p - g) <= tol * abs(g)`. For g==0, absolute tolerance 1e-9. Catch ValueError -> False. |
| 6. `_compare_string(pred, gold)`: `_normalize_value(pred) == _normalize_value(gold)`. |
| 7. `_compare_list(pred, gold, gold_rows)`: If `gold_rows` is provided, build gold set from `{str(cell) for row in gold_rows for cell in row}`. Parse predicted by splitting on `,` and `\n`. Normalize both sides, compare as sets. If no `gold_rows`, parse gold string by splitting on ` | ` and `\n`. |
| 8. Guard: if `predicted` is empty after strip, return False immediately. |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T22:18:15Z |
| **Changes Made:** |
| - `server/verifier.py` - replaced stub content with `verify_answer()` and helper comparers for integer, float, string, and list handling. |
|
|
| **Result:** |
| - **Outcome:** Fully Successful |
| - **Evidence Captured:** |
| ``` |
| uv run --extra dev pytest tests/ -v |
| ======================== 25 passed in 81.43s ========================= |
| ``` |
| - **Tests run:** `uv run --extra dev pytest tests/ -v` |
| - **Notes:** |
| - Implemented `verify_answer()` dispatch with fallback to normalized string comparison for unknown or missing answer types. |
| - Added deterministic helper behavior: integer coercion via `int(float(x))`, float relative tolerance (1%), and list set comparison. |
| - Used `uv run --extra dev` because local environment did not yet include pytest from dev extras. |
| - **Issues:** None | [short bullet list if any] |
| - **Follow-ups Created:** None | [list of new step IDs if issues spawned new steps] |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Add `tests/test_verifier.py` coverage for dispatcher paths, comparer edge cases, and fallback logic from `specs/F002-VERIFICATION_SPEC.md`. |
|
|
| --- |
|
|
| ### Step 1.2: Unit tests for verifier |
| **Slice:** S1 |
| **Goal:** Create comprehensive unit tests covering all 4 answer types, edge cases, and the fallback path. |
|
|
| **Files:** |
| - `tests/test_verifier.py` - create - Unit tests for verify_answer and all comparers |
| |
| **Interface Changes:** None (test-only) |
| |
| **Implementation Details:** |
| 1. Test `_compare_integer`: "42" vs "42", "42.0" vs "42", "abc" vs "42" (False), "" vs "42" (False). |
| 2. Test `_compare_float`: "95000.1" vs "95000" (True, within 1%), "100" vs "200" (False), "0" vs "0" (True), "abc" vs "1.0" (False). |
| 3. Test `_compare_string`: "Engineering" vs "engineering" (True), " hello " vs "hello" (True), "a" vs "b" (False). |
| 4. Test `_compare_list`: "A, B" vs "B, A" (True), "A" vs "A, B" (False), test with gold_rows provided. |
| 5. Test `verify_answer` dispatch: each type routes correctly, None/unknown falls back to string. |
| 6. Test edge cases: empty predicted (False), None gold coerced to "" (False). |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T22:21:30Z |
| **Changes Made:** |
| - `tests/test_verifier.py` - created comprehensive unit coverage for verifier dispatch and helper comparers across integer, float, string, and list cases. |
|
|
| **Result:** |
| - **Outcome:** Fully Successful |
| - **Evidence Captured:** |
| ``` |
| uv run pytest tests/test_verifier.py -v |
| ============================== 31 passed in 6.19s ============================== |
| ``` |
| - **Tests run:** `uv run pytest tests/test_verifier.py -v` |
| - **Notes:** |
| - Added dispatcher tests for all answer types plus fallback and empty-predicted guards. |
| - Added comparer edge-case tests (int truncation, float tolerance boundaries, list parsing with/without `gold_rows`). |
| - Kept coverage aligned to existing verifier behavior (normalized whitespace/case comparison). |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Add `gold_rows` to `EpisodeContext` in `models.py` and persist raw gold query rows during `reset()` in `server/sql_environment.py`. |
|
|
| --- |
|
|
| ### Step 2.1: Add gold_rows to EpisodeContext and populate during reset |
| **Slice:** S2 |
| **Goal:** Add `gold_rows` field to `EpisodeContext` and populate it when an episode is reset (alongside `gold_answer`). |
| |
| **Files:** |
| - `models.py` - modify - Add `gold_rows: list[tuple] | None = None` to EpisodeContext |
| - `server/sql_environment.py` - modify - Populate `gold_rows` during episode reset where `gold_answer` is set |
|
|
| **Interface Changes:** |
| - `EpisodeContext.gold_rows: list[tuple] | None = None` (new field) |
|
|
| **Implementation Details:** |
| 1. Add `gold_rows: list[tuple] | None = None` to `EpisodeContext` dataclass after `gold_answer`. |
| 2. In `sql_environment.py`, find where `gold_answer` is populated during `reset()`. At the same location, store the raw rows in `gold_rows` before they are formatted. |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T22:24:54Z |
| **Changes Made:** |
| - `models.py` - added `gold_rows: list[tuple] | None = None` to `EpisodeContext`. |
| - `server/sql_environment.py` - persisted raw gold query rows into `EpisodeContext.gold_rows` during `reset()`. |
| - `tests/test_verifier.py` - added `EpisodeContext.gold_rows` unit tests (default `None`, populated list, empty list). |
|
|
| **Result:** |
| - **Outcome:** Fully Successful |
| - **Evidence Captured:** |
| ``` |
| uv run pytest tests/test_verifier.py -v |
| ============================== 34 passed in 6.18s ============================== |
| ``` |
| - **Tests run:** `uv run pytest tests/test_verifier.py -v` |
| - **Notes:** |
| - Stored structured `gold_rows` at reset-time where gold SQL is already executed, so no extra SQL execution path was introduced. |
| - Added direct dataclass tests for `EpisodeContext.gold_rows` to satisfy verification criteria for the new interface field. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Replace `_handle_answer()` naive normalized string equality with `verify_answer(predicted, gold, answer_type, gold_rows)` and keep terminal reward mapping unchanged. |
|
|
| --- |
|
|
| ### Step 2.2: Wire verify_answer into _handle_answer |
| **Slice:** S2 |
| **Goal:** Replace naive string comparison in `_handle_answer()` with `verify_answer()` call. |
|
|
| **Files:** |
| - `server/sql_environment.py` - modify - Import and call `verify_answer()` in `_handle_answer()` |
|
|
| **Interface Changes:** |
| - Modified function: `_handle_answer()` now delegates to `verify_answer()` |
|
|
| **Implementation Details:** |
| 1. Add import: `from server.verifier import verify_answer` at top of `sql_environment.py`. |
| 2. Replace the body of `_handle_answer()`: |
| - Remove: `submitted = value.strip().lower()` / `expected = ...` / `is_correct = submitted == expected` |
| - Add: `is_correct = verify_answer(predicted=value, gold=self._episode.gold_answer or "", answer_type=self._episode.question_record.answer_type, gold_rows=self._episode.gold_rows)` |
| 3. Keep: `self._episode.done = True` and `return is_correct, 1.0 if is_correct else 0.0` |
| 4. Run existing smoke tests to confirm no regressions. |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
| - [x] Existing 25 smoke tests still pass |
|
|
| **Status:** Completed |
|
|
| <!-- Filled by /autocode-next-step after implementation --> |
| **Completed:** 2026-03-27T22:33:12Z |
| **Changes Made:** |
| - `server/sql_environment.py` - imported `verify_answer` and replaced `_handle_answer()` naive normalized-string equality with `verify_answer(predicted, gold, answer_type, gold_rows)`. |
| - `tests/test_verifier_integration.py` - added integration coverage for integer/float/string/list answer flows, fallback behavior for missing `answer_type`, and numeric coercion failure path. |
|
|
| **Result:** |
| - **Outcome:** Fully Successful |
| - **Evidence Captured:** |
| ``` |
| uv run pytest tests/test_verifier.py -v |
| ============================== 34 passed in 6.64s ============================== |
| |
| uv run pytest tests/test_smoke.py -v |
| ============================== 25 passed in 6.53s ============================== |
| |
| uv run pytest tests/test_verifier_integration.py -v |
| ============================== 6 passed in 6.65s ============================== |
| |
| uv run pytest tests/ -v |
| ============================== 65 passed in 6.62s ============================== |
| ``` |
| - **Tests run:** `uv run pytest tests/test_verifier.py -v`; `uv run pytest tests/test_smoke.py -v`; `uv run pytest tests/test_verifier_integration.py -v`; `uv run pytest tests/ -v` |
| - **Notes:** |
| - `_handle_answer()` now uses a single verifier dispatch path, keeping answer comparison logic centralized in `server/verifier.py`. |
| - Added integration tests because `VERIFICATION_SPEC.md` expected `tests/test_verifier_integration.py` evidence. |
| - Behavior delta was archived into `specs/behavior/sql-environment.md` and the delta file was removed. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Implementation complete. Proceed with commit/PR workflow (`/commit-push-pr`) for F002. |
|
|
| --- |
|
|
| ## 8. Rollout Considerations |
|
|
| ### Feature Flags |
| - [x] Required: No |
| - [ ] Flag name: N/A |
|
|
| ### Migration |
| - [x] Data migration needed: No |
| - [ ] Migration strategy: N/A |
|
|
| ### Rollback Plan |
| Revert `_handle_answer()` to inline string comparison (3 lines). The `verify_answer()` module and `gold_rows` field are additive and harmless if unused. |
|
|
| --- |
|
|
| ## 9. Execution Tracking |
|
|
| All execution state is tracked within this document: |
| - **Section 1a:** Overall progress summary |
| - **Section 7:** Per-step completion details, test results, and handoff context |
| - **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run` |
| - **Git history:** Full audit trail of changes to this file |
|
|
| The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by: |
| - Checking Section 1a for summary |
| - Reviewing Section 7 for detailed step status |
| - Inspecting the feature's `progress` and `status` fields in `FEATURES.json` |
| - Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history |
|
|
| --- |
|
|
| ## 9a. Slice Completion Protocol |
|
|
| After all steps in a slice pass verification: |
|
|
| 1. **Run verifier subagent** for spec compliance |
| - Validates against VERIFICATION_SPEC.md criteria |
| - Ensures no TODOs or incomplete work in slice |
| |
| 2. **Run compound-engineer subagent** to extract learnings |
| - **Mandatory invocation** after every slice completion |
| - Updates CLAUDE.md Learnings section (if durable patterns found) |
| - May exit with "no update needed" (valid for routine work) |
| |
| 3. **Commit** the slice changes |
| - Follow commit message format in CLAUDE.md |
| - Each slice gets its own atomic commit |
| |
| 4. **Continue to next slice** (if more slices remain) |
| - Or proceed to final verification if all slices complete |
| |
| **Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready. |
| |
| --- |
| |
| ## 10. User Value Summary |
| |
| <!-- Populated by /autocode-next-step when final step completes --> |
| |
| **Status:** Generated |
| |
| ### What Users Can Now Do |
| Users can now submit answers across integer, float, string, and list questions and get correct pass/fail outcomes even when answers differ in formatting, case, numeric representation, or list ordering. |
| |
| ### How to Access/Test |
| Run `uv run pytest tests/test_verifier.py tests/test_verifier_integration.py -v`, or run `uv run pytest tests/ -v` for full regression coverage including end-to-end ANSWER handling through `SQLEnvironment.step()`. |
|
|
| ### Demo |
| - **Command:** `uv run pytest tests/test_verifier_integration.py -v` |
|
|
| ### Release Notes Snippet |
| Added type-aware answer verification so ANSWER correctness now supports numeric coercion, float tolerance, case-insensitive strings, and order-insensitive list matching. |
|
|
| --- |
|
|
| ## 11. PR Contract (Auto-Generated by autocode-next-step) |
|
|
| <!-- This section is auto-populated by autocode-next-step command when all steps complete --> |
|
|
| **Status:** Generated |
|
|
| ### Summary |
| - Implemented type-aware answer verification in environment answer handling by routing `_handle_answer()` through `verify_answer()`. |
| - Added integration coverage for typed answer paths and fallback behavior (`tests/test_verifier_integration.py`). |
| - Archived F002 behavior delta into `specs/behavior/sql-environment.md` and captured durable learnings in `docs/learnings/F002-*.md`. |
|
|
| ### Validation |
| - `uv run pytest tests/test_verifier.py -v` -> 34 passed |
| - `uv run pytest tests/test_smoke.py -v` -> 25 passed |
| - `uv run pytest tests/test_verifier_integration.py -v` -> 6 passed |
| - `uv run pytest tests/ -v` -> 65 passed |
|
|
| ### Scope and Risk |
| - Risk tier: Low |
| - Security-sensitive changes: None |
| - Scope creep: None (added integration tests to satisfy verification spec evidence requirements) |
|
|
| ### Ready Action |
| All steps completed. Run `/commit-push-pr`. |
|
|
| ### PR Created |
| https://github.com/hjerpe/sql-env/pull/7 |
|
|
| --- |
|
|
| ## Stop Conditions (When to Split This Spec) |
|
|
| Stop and create a new IMPLEMENTATION_SPEC if: |
| - A step requires touching more than **3 files** in unrelated areas |
| - You need to introduce **multiple new abstractions** "just in case" |
| - Verification cannot be made targeted and concrete |
| - You discover new unknowns that change the plan materially |
| - The next slice cannot be merged safely without finishing later slices |
| |
| When splitting, ensure the current slice ends in a merged, stable state. |
| |
| --- |
| |
| ## Human Checkpoint |
| |
| **Before handing to AI agent:** |
| |
| - [ ] Interface specifications are complete |
| - [ ] Data flow is accurate |
| - [ ] Error handling is specified |
| - [ ] Implementation order makes sense |
| - [ ] VERIFICATION_SPEC.md has been generated |
|
|
| **Questions:** |
| 1. Should float tolerance be configurable per-question or fixed at 1%? |
| 2. Any additional answer_type values beyond the four specified? |
| |
| --- |
| |
| ## Handoff Notes |
| |
| **For the implementing AI agent:** |
| |
| ``` |
| Context: See RESEARCH_SUMMARY.md for system understanding |
| Spec: Follow this document exactly |
| Verification: Use tests from VERIFICATION_SPEC.md (independent agent) |
| Ambiguity: Stop and ask rather than assume |
| Order: Follow implementation order exactly |
| Key decisions: |
| - gold_rows passed raw to verifier (not just formatted string) |
| - Fallback to string comparison when answer_type is None/unknown |
| - No external dependencies -- pure Python only |
| - match/case dispatch, not class hierarchy |
| ``` |
| |
| --- |
| |
| *Specification completed: 2026-03-27* |
| *Approved by: [NAME/ROLE]* |
| *Verification spec: VERIFICATION_SPEC.md* |
| *Verification input: [F002-VERIFICATION_INPUT.json](F002-VERIFICATION_INPUT.json)* |
| *Target agent: Claude Code* |
|
|