Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F002-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified about 2 months ago

preview code

raw

history blame contribute delete

27.9 kB

	# Implementation Specification

	Change: F002 -- Answer Verification (multi-type comparison)
	Date: 2026-03-27
	Research Summary: [specs/F002-RESEARCH_SUMMARY.md](F002-RESEARCH_SUMMARY.md)
	Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
	Behavior Delta: Archived into `specs/behavior/sql-environment.md`

	Plan Status:
	- [x] Draft
	- [x] Approved for Implementation
	- [x] Implementation Complete
	- [x] Verification Passed

	---

	## Core Intent (Immutable)

	> DO NOT MODIFY THIS SECTION DURING REFINEMENT
	> Changes to Core Intent mean you're describing a different feature.
	> If refinement reveals the need to change this section, create a new feature instead.

	User Problem:
	When an agent submits ANSWER, the environment correctly determines if the answer matches the gold answer regardless of type (42 vs 42.0, 'Engineering' vs 'engineering', unordered lists).

	Success Criteria:
	- Float comparison with tolerance handles rounding gracefully (95000.1 matches 95000)
	- List comparison ignores order: ['A','B'] matches ['B','A']
	- Clear pass/fail with no ambiguity

	Avoid:
	- Correct answer rejected due to trivial formatting difference
	- Type coercion failures (agent says '42', gold is integer 42)

	Out of Scope:
	- Table comparison (multi-column row overlap) -- deferred to post-MVP
	- Partial credit scoring -- binary pass/fail only at this layer
	- Changes to reward signal structure (F003 scope)

	---

	## 0. Slicing & Scope Budget (Anti-Waterfall)

	This spec must be executable in small, mergeable increments.

	### Scope Budget
	- Target: 2 slices
	- Hard max: <= 10 steps total
	- Each step must end in: implement -> verify -> merge

	### Slice Definition
	A slice is a vertical increment that delivers user-visible value or a safe internal capability.

	Each slice must have:
	- Clear outcome
	- Minimal interface change
	- Merge criteria

	Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

	## Status Icons

	Step Status:
	- ??? Not Started
	- ? In Progress
	- ? Completed
	- ? Blocked/Failed

	Result Outcome:
	- ? Fully Successful (all tests passed, no issues)
	- ?? Completed with Issues (needs follow-up)
	- ? Failed/Blocked

	---

	## 1. Implementation Overview

	### Summary
	Implement `verify_answer()` in `server/verifier.py` with type-aware comparison dispatching across four answer types (integer, float, string, list). Wire it into `_handle_answer()` in `server/sql_environment.py`, replacing the naive string comparison. Add `gold_rows` field to `EpisodeContext` so the verifier receives raw data for accurate list comparison. Fallback to string comparison when `answer_type` is missing.

	### Scope

	In Scope:
	- `verify_answer()` public function with 4 type comparers
	- Private helpers: `_normalize_value`, `_compare_integer`, `_compare_float`, `_compare_string`, `_compare_list`
	- `gold_rows` field on `EpisodeContext`
	- Integration into `_handle_answer()`
	- Unit tests for all comparers and edge cases

	Out of Scope:
	- Table comparison (multi-column)
	- Partial credit / dense reward (F003)
	- Changes to question data schema (answer_type already exists)
	- External dependencies (pure Python only)

	---

	## 1a. Execution Status
	<!-- Auto-updated by /autocode-next-step - do not edit manually -->

	Progress: 4/4 steps complete
	Current Step: None (all implementation steps complete)
	Last Updated: 2026-03-27T22:33:12Z
	Latest Result: Fully Successful (all tests passed, no issues)
	Blockers: None

	---

	## 1b. Risk Assessment

	Risk Tier: Low

	High-Risk Indicators Present: (none apply)
	- [ ] Touches authentication or authorization logic
	- [ ] Handles payment processing or financial data
	- [ ] Manages secrets, API keys, or credentials
	- [ ] Processes untrusted user input (file uploads, external APIs)
	- [ ] Modifies privilege/permission systems

	Security Review Required: No

	Justification:
	Pure logic module that compares two values. No user input beyond agent's ANSWER string (already sanitized by action parsing). No I/O, no network, no secrets.

	---

	## 2. Change Manifest

	### Files to Create

	\| File \| Purpose \|
	\|------\|---------\|
	\| `tests/test_verifier.py` \| Unit tests for all comparison types and edge cases \|

	### Files to Modify

	\| File \| Changes \|
	\|------\|---------\|
	\| `server/verifier.py` \| Replace stub with full `verify_answer()` + private helpers \|
	\| `models.py` \| Add `gold_rows: list[tuple] \| None = None` to `EpisodeContext` \|
	\| `server/sql_environment.py` \| Wire `verify_answer()` into `_handle_answer()`, populate `gold_rows` \|

	### Files to Delete

	None.

	---

	## 3. Interface Specifications

	### Modified Types

	```python
	# Location: models.py
	# CHANGE: Add gold_rows field to EpisodeContext

	@dataclass
	class EpisodeContext:
	"""Per-episode server-side state (never sent to agent)."""

	episode_id: str
	db_connection: sqlite3.Connection
	question_record: QuestionRecord
	step_count: int = 0
	budget: int = 15
	described_tables: set[str] = dataclass_field(default_factory=set)
	action_log: list[str] = dataclass_field(default_factory=list)
	done: bool = False
	gold_answer: str \| None = None
	gold_rows: list[tuple] \| None = None # NEW: raw SQL result rows for verifier
	```

	### New Functions

	```python
	# Location: server/verifier.py

	def verify_answer(
	predicted: str,
	gold: str,
	answer_type: str \| None = None,
	gold_rows: list[tuple] \| None = None,
	) -> bool:
	"""
	Compare agent's submitted answer against the gold answer.

	Dispatches to type-specific comparers based on answer_type.
	Falls back to string comparison when answer_type is None or unknown.

	Args:
	predicted: The agent's submitted answer string.
	gold: The gold answer as a formatted string.
	answer_type: One of "integer", "float", "string", "list", or None.
	gold_rows: Raw SQL result rows (list of tuples) for accurate list comparison.

	Returns:
	True if the answer is correct, False otherwise.
	"""
	```

	```python
	# Location: server/verifier.py (private helpers)

	def _normalize_value(value: str) -> str:
	"""Strip whitespace and lowercase a value for comparison."""

	def _compare_integer(predicted: str, gold: str) -> bool:
	"""
	Compare as integers after coercing both sides.

	Handles: "42" vs 42, "42.0" vs 42.
	Returns False on ValueError (non-numeric input).
	"""

	def _compare_float(predicted: str, gold: str, tolerance: float = 0.01) -> bool:
	"""
	Compare as floats with relative tolerance (default 1%).

	Uses: abs(pred - gold) <= tolerance * abs(gold) when gold != 0.
	For gold == 0: uses absolute tolerance of 1e-9.
	Returns False on ValueError.
	"""

	def _compare_string(predicted: str, gold: str) -> bool:
	"""Case-insensitive, whitespace-normalized string comparison."""

	def _compare_list(
	predicted: str,
	gold: str,
	gold_rows: list[tuple] \| None = None,
	) -> bool:
	"""
	Order-insensitive set comparison.

	If gold_rows is provided, converts both sides to sets of normalized strings.
	Otherwise parses the formatted string (split on ' \| ' and newlines).
	"""
	```

	### Modified Functions

	```python
	# Location: server/sql_environment.py
	# CHANGE: Replace naive comparison with verify_answer() call

	def _handle_answer(self, value: str) -> tuple[bool, float]:
	"""Compare submitted answer against episode gold answer using type-aware verifier."""
	if self._episode is None:
	raise RuntimeError("No active episode. Call reset() before step().")

	is_correct = verify_answer(
	predicted=value,
	gold=self._episode.gold_answer or "",
	answer_type=self._episode.question_record.answer_type,
	gold_rows=self._episode.gold_rows,
	)
	self._episode.done = True
	return is_correct, 1.0 if is_correct else 0.0
	```

	---

	## 4. Data Flow

	### Primary Flow

	```
	1. Agent sends ANSWER action with value string
	- Input: action.argument (str)

	2. step() dispatches to _handle_answer(value)
	- Input: value (str)

	3. _handle_answer() calls verify_answer(predicted, gold, answer_type, gold_rows)
	- predicted: value (agent's answer)
	- gold: self._episode.gold_answer (formatted string)
	- answer_type: self._episode.question_record.answer_type
	- gold_rows: self._episode.gold_rows (raw tuples or None)

	4. verify_answer() dispatches by answer_type:
	- "integer" -> _compare_integer(predicted, gold)
	- "float" -> _compare_float(predicted, gold)
	- "string" -> _compare_string(predicted, gold)
	- "list" -> _compare_list(predicted, gold, gold_rows)
	- None/unknown -> _compare_string(predicted, gold)

	5. Returns bool -> _handle_answer returns (bool, float reward)
	```

	### Alternative Flows

	When answer_type is None or unknown:
	```
	1. verify_answer receives answer_type=None
	2. Falls back to _compare_string(predicted, gold)
	3. Returns bool (case-insensitive normalized comparison)
	```

	When predicted or gold is empty/None:
	```
	1. verify_answer receives empty string or None-coerced value
	2. Returns False immediately (no valid answer to compare)
	```

	When type coercion fails (e.g., "abc" as integer):
	```
	1. _compare_integer or _compare_float catches ValueError
	2. Falls back to returning False
	```

	---

	## 5. Error Handling

	### Error Types

	\| Error \| When \| Behavior \|
	\|-------\|------\|----------\|
	\| `ValueError` (caught internally) \| Predicted value cannot be coerced to int/float \| Return False (not correct) \|
	\| `RuntimeError` \| `_handle_answer` called with no active episode \| Raised to caller (existing behavior) \|

	### Error Handling Strategy

	```python
	# Pattern: catch coercion errors, return False (answer is wrong, not a crash)
	def _compare_integer(predicted: str, gold: str) -> bool:
	try:
	return int(float(predicted)) == int(float(gold))
	except (ValueError, TypeError):
	return False
	```

	### Retry Strategy

	\| Operation \| Retry? \| Strategy \|
	\|-----------\|--------\|----------\|
	\| `verify_answer()` \| No \| Deterministic comparison, no transient failures \|

	---

	## 6. Slice Plan (What we will ship, in order)

	### Slice S1 -- Core Verifier Module
	Value: `verify_answer()` exists as a tested, standalone module with all 4 type comparers
	User-visible change: No (not yet wired in)
	Interfaces introduced/changed: `verify_answer()`, `_normalize_value()`, `_compare_integer()`, `_compare_float()`, `_compare_string()`, `_compare_list()`
	Rollback safety: Additive only -- new file, no existing code changed

	### Slice S2 -- Integration and Wiring
	Value: `_handle_answer()` uses type-aware verification; agents get correct results for float/list/integer answers
	User-visible change: Yes -- agent answers previously rejected (e.g., "42" vs integer 42) now accepted
	Interfaces introduced/changed: `EpisodeContext.gold_rows`, modified `_handle_answer()`
	Rollback safety: Revert to naive string compare by removing import and restoring 3 lines

	---

	## 7. Implementation Steps

	> VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md.
	> The verification-planner (separate agent) generated independent test criteria.
	> Run the tests specified there after implementing each step.

	### Step 1.1: Implement verify_answer module
	Slice: S1
	Goal: Create the complete `verify_answer()` function with all 4 type-specific comparers in `server/verifier.py`.

	Files:
	- `server/verifier.py` - modify - Replace stub with full implementation

	Interface Changes:
	- New public function: `verify_answer(predicted, gold, answer_type, gold_rows) -> bool`
	- New private helpers: `_normalize_value`, `_compare_integer`, `_compare_float`, `_compare_string`, `_compare_list`

	Implementation Details:
	1. Replace the docstring-only stub in `server/verifier.py` with the full module.
	2. `verify_answer()` uses match/case on `answer_type` to dispatch.
	3. `_normalize_value(value)`: `value.strip().lower()`.
	4. `_compare_integer(pred, gold)`: coerce both via `int(float(x))`, exact match. Catch ValueError -> False.
	5. `_compare_float(pred, gold, tolerance=0.01)`: relative tolerance `abs(p - g) <= tol * abs(g)`. For g==0, absolute tolerance 1e-9. Catch ValueError -> False.
	6. `_compare_string(pred, gold)`: `_normalize_value(pred) == _normalize_value(gold)`.
	7. `_compare_list(pred, gold, gold_rows)`: If `gold_rows` is provided, build gold set from `{str(cell) for row in gold_rows for cell in row}`. Parse predicted by splitting on `,` and `\n`. Normalize both sides, compare as sets. If no `gold_rows`, parse gold string by splitting on ` \| ` and `\n`.
	8. Guard: if `predicted` is empty after strip, return False immediately.

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T22:18:15Z
	Changes Made:
	- `server/verifier.py` - replaced stub content with `verify_answer()` and helper comparers for integer, float, string, and list handling.

	Result:
	- Outcome: Fully Successful
	- Evidence Captured:
	```
	uv run --extra dev pytest tests/ -v
	======================== 25 passed in 81.43s =========================
	```
	- Tests run: `uv run --extra dev pytest tests/ -v`
	- Notes:
	- Implemented `verify_answer()` dispatch with fallback to normalized string comparison for unknown or missing answer types.
	- Added deterministic helper behavior: integer coercion via `int(float(x))`, float relative tolerance (1%), and list set comparison.
	- Used `uv run --extra dev` because local environment did not yet include pytest from dev extras.
	- Issues: None \| [short bullet list if any]
	- Follow-ups Created: None \| [list of new step IDs if issues spawned new steps]
	- Human Review Completed: N/A

	Context for Next Step:
	- Add `tests/test_verifier.py` coverage for dispatcher paths, comparer edge cases, and fallback logic from `specs/F002-VERIFICATION_SPEC.md`.

	---

	### Step 1.2: Unit tests for verifier
	Slice: S1
	Goal: Create comprehensive unit tests covering all 4 answer types, edge cases, and the fallback path.

	Files:
	- `tests/test_verifier.py` - create - Unit tests for verify_answer and all comparers

	Interface Changes: None (test-only)

	Implementation Details:
	1. Test `_compare_integer`: "42" vs "42", "42.0" vs "42", "abc" vs "42" (False), "" vs "42" (False).
	2. Test `_compare_float`: "95000.1" vs "95000" (True, within 1%), "100" vs "200" (False), "0" vs "0" (True), "abc" vs "1.0" (False).
	3. Test `_compare_string`: "Engineering" vs "engineering" (True), " hello " vs "hello" (True), "a" vs "b" (False).
	4. Test `_compare_list`: "A, B" vs "B, A" (True), "A" vs "A, B" (False), test with gold_rows provided.
	5. Test `verify_answer` dispatch: each type routes correctly, None/unknown falls back to string.
	6. Test edge cases: empty predicted (False), None gold coerced to "" (False).

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T22:21:30Z
	Changes Made:
	- `tests/test_verifier.py` - created comprehensive unit coverage for verifier dispatch and helper comparers across integer, float, string, and list cases.

	Result:
	- Outcome: Fully Successful
	- Evidence Captured:
	```
	uv run pytest tests/test_verifier.py -v
	============================== 31 passed in 6.19s ==============================
	```
	- Tests run: `uv run pytest tests/test_verifier.py -v`
	- Notes:
	- Added dispatcher tests for all answer types plus fallback and empty-predicted guards.
	- Added comparer edge-case tests (int truncation, float tolerance boundaries, list parsing with/without `gold_rows`).
	- Kept coverage aligned to existing verifier behavior (normalized whitespace/case comparison).
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Add `gold_rows` to `EpisodeContext` in `models.py` and persist raw gold query rows during `reset()` in `server/sql_environment.py`.

	---

	### Step 2.1: Add gold_rows to EpisodeContext and populate during reset
	Slice: S2
	Goal: Add `gold_rows` field to `EpisodeContext` and populate it when an episode is reset (alongside `gold_answer`).

	Files:
	- `models.py` - modify - Add `gold_rows: list[tuple] \| None = None` to EpisodeContext
	- `server/sql_environment.py` - modify - Populate `gold_rows` during episode reset where `gold_answer` is set

	Interface Changes:
	- `EpisodeContext.gold_rows: list[tuple] \| None = None` (new field)

	Implementation Details:
	1. Add `gold_rows: list[tuple] \| None = None` to `EpisodeContext` dataclass after `gold_answer`.
	2. In `sql_environment.py`, find where `gold_answer` is populated during `reset()`. At the same location, store the raw rows in `gold_rows` before they are formatted.

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T22:24:54Z
	Changes Made:
	- `models.py` - added `gold_rows: list[tuple] \| None = None` to `EpisodeContext`.
	- `server/sql_environment.py` - persisted raw gold query rows into `EpisodeContext.gold_rows` during `reset()`.
	- `tests/test_verifier.py` - added `EpisodeContext.gold_rows` unit tests (default `None`, populated list, empty list).

	Result:
	- Outcome: Fully Successful
	- Evidence Captured:
	```
	uv run pytest tests/test_verifier.py -v
	============================== 34 passed in 6.18s ==============================
	```
	- Tests run: `uv run pytest tests/test_verifier.py -v`
	- Notes:
	- Stored structured `gold_rows` at reset-time where gold SQL is already executed, so no extra SQL execution path was introduced.
	- Added direct dataclass tests for `EpisodeContext.gold_rows` to satisfy verification criteria for the new interface field.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Replace `_handle_answer()` naive normalized string equality with `verify_answer(predicted, gold, answer_type, gold_rows)` and keep terminal reward mapping unchanged.

	---

	### Step 2.2: Wire verify_answer into _handle_answer
	Slice: S2
	Goal: Replace naive string comparison in `_handle_answer()` with `verify_answer()` call.

	Files:
	- `server/sql_environment.py` - modify - Import and call `verify_answer()` in `_handle_answer()`

	Interface Changes:
	- Modified function: `_handle_answer()` now delegates to `verify_answer()`

	Implementation Details:
	1. Add import: `from server.verifier import verify_answer` at top of `sql_environment.py`.
	2. Replace the body of `_handle_answer()`:
	- Remove: `submitted = value.strip().lower()` / `expected = ...` / `is_correct = submitted == expected`
	- Add: `is_correct = verify_answer(predicted=value, gold=self._episode.gold_answer or "", answer_type=self._episode.question_record.answer_type, gold_rows=self._episode.gold_rows)`
	3. Keep: `self._episode.done = True` and `return is_correct, 1.0 if is_correct else 0.0`
	4. Run existing smoke tests to confirm no regressions.

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)
	- [x] Existing 25 smoke tests still pass

	Status: Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T22:33:12Z
	Changes Made:
	- `server/sql_environment.py` - imported `verify_answer` and replaced `_handle_answer()` naive normalized-string equality with `verify_answer(predicted, gold, answer_type, gold_rows)`.
	- `tests/test_verifier_integration.py` - added integration coverage for integer/float/string/list answer flows, fallback behavior for missing `answer_type`, and numeric coercion failure path.

	Result:
	- Outcome: Fully Successful
	- Evidence Captured:
	```
	uv run pytest tests/test_verifier.py -v
	============================== 34 passed in 6.64s ==============================

	uv run pytest tests/test_smoke.py -v
	============================== 25 passed in 6.53s ==============================

	uv run pytest tests/test_verifier_integration.py -v
	============================== 6 passed in 6.65s ==============================

	uv run pytest tests/ -v
	============================== 65 passed in 6.62s ==============================
	```
	- Tests run: `uv run pytest tests/test_verifier.py -v`; `uv run pytest tests/test_smoke.py -v`; `uv run pytest tests/test_verifier_integration.py -v`; `uv run pytest tests/ -v`
	- Notes:
	- `_handle_answer()` now uses a single verifier dispatch path, keeping answer comparison logic centralized in `server/verifier.py`.
	- Added integration tests because `VERIFICATION_SPEC.md` expected `tests/test_verifier_integration.py` evidence.
	- Behavior delta was archived into `specs/behavior/sql-environment.md` and the delta file was removed.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Implementation complete. Proceed with commit/PR workflow (`/commit-push-pr`) for F002.

	---

	## 8. Rollout Considerations

	### Feature Flags
	- [x] Required: No
	- [ ] Flag name: N/A

	### Migration
	- [x] Data migration needed: No
	- [ ] Migration strategy: N/A

	### Rollback Plan
	Revert `_handle_answer()` to inline string comparison (3 lines). The `verify_answer()` module and `gold_rows` field are additive and harmless if unused.

	---

	## 9. Execution Tracking

	All execution state is tracked within this document:
	- Section 1a: Overall progress summary
	- Section 7: Per-step completion details, test results, and handoff context
	- FEATURES.json: Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
	- Git history: Full audit trail of changes to this file

	The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
	- Checking Section 1a for summary
	- Reviewing Section 7 for detailed step status
	- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
	- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

	---

	## 9a. Slice Completion Protocol

	After all steps in a slice pass verification:

	1. Run verifier subagent for spec compliance
	- Validates against VERIFICATION_SPEC.md criteria
	- Ensures no TODOs or incomplete work in slice

	2. Run compound-engineer subagent to extract learnings
	- Mandatory invocation after every slice completion
	- Updates CLAUDE.md Learnings section (if durable patterns found)
	- May exit with "no update needed" (valid for routine work)

	3. Commit the slice changes
	- Follow commit message format in CLAUDE.md
	- Each slice gets its own atomic commit

	4. Continue to next slice (if more slices remain)
	- Or proceed to final verification if all slices complete

	Note: PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

	---

	## 10. User Value Summary

	<!-- Populated by /autocode-next-step when final step completes -->

	Status: Generated

	### What Users Can Now Do
	Users can now submit answers across integer, float, string, and list questions and get correct pass/fail outcomes even when answers differ in formatting, case, numeric representation, or list ordering.

	### How to Access/Test
	Run `uv run pytest tests/test_verifier.py tests/test_verifier_integration.py -v`, or run `uv run pytest tests/ -v` for full regression coverage including end-to-end ANSWER handling through `SQLEnvironment.step()`.

	### Demo
	- Command: `uv run pytest tests/test_verifier_integration.py -v`

	### Release Notes Snippet
	Added type-aware answer verification so ANSWER correctness now supports numeric coercion, float tolerance, case-insensitive strings, and order-insensitive list matching.

	---

	## 11. PR Contract (Auto-Generated by autocode-next-step)

	<!-- This section is auto-populated by autocode-next-step command when all steps complete -->

	Status: Generated

	### Summary
	- Implemented type-aware answer verification in environment answer handling by routing `_handle_answer()` through `verify_answer()`.
	- Added integration coverage for typed answer paths and fallback behavior (`tests/test_verifier_integration.py`).
	- Archived F002 behavior delta into `specs/behavior/sql-environment.md` and captured durable learnings in `docs/learnings/F002-*.md`.

	### Validation
	- `uv run pytest tests/test_verifier.py -v` -> 34 passed
	- `uv run pytest tests/test_smoke.py -v` -> 25 passed
	- `uv run pytest tests/test_verifier_integration.py -v` -> 6 passed
	- `uv run pytest tests/ -v` -> 65 passed

	### Scope and Risk
	- Risk tier: Low
	- Security-sensitive changes: None
	- Scope creep: None (added integration tests to satisfy verification spec evidence requirements)

	### Ready Action
	All steps completed. Run `/commit-push-pr`.

	### PR Created
	https://github.com/hjerpe/sql-env/pull/7

	---

	## Stop Conditions (When to Split This Spec)

	Stop and create a new IMPLEMENTATION_SPEC if:
	- A step requires touching more than 3 files in unrelated areas
	- You need to introduce multiple new abstractions "just in case"
	- Verification cannot be made targeted and concrete
	- You discover new unknowns that change the plan materially
	- The next slice cannot be merged safely without finishing later slices

	When splitting, ensure the current slice ends in a merged, stable state.

	---

	## Human Checkpoint

	Before handing to AI agent:

	- [ ] Interface specifications are complete
	- [ ] Data flow is accurate
	- [ ] Error handling is specified
	- [ ] Implementation order makes sense
	- [ ] VERIFICATION_SPEC.md has been generated

	Questions:
	1. Should float tolerance be configurable per-question or fixed at 1%?
	2. Any additional answer_type values beyond the four specified?

	---

	## Handoff Notes

	For the implementing AI agent:

	```
	Context: See RESEARCH_SUMMARY.md for system understanding
	Spec: Follow this document exactly
	Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
	Ambiguity: Stop and ask rather than assume
	Order: Follow implementation order exactly
	Key decisions:
	- gold_rows passed raw to verifier (not just formatted string)
	- Fallback to string comparison when answer_type is None/unknown
	- No external dependencies -- pure Python only
	- match/case dispatch, not class hierarchy
	```

	---

	Specification completed: 2026-03-27
	Approved by: [NAME/ROLE]
	Verification spec: VERIFICATION_SPEC.md
	Verification input: [F002-VERIFICATION_INPUT.json](F002-VERIFICATION_INPUT.json)
	Target agent: Claude Code