Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F005-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified about 1 month ago

preview code

raw

history blame contribute delete

26.3 kB

	# Implementation Specification

	Change: F005 -- Green Agent Wrapper (automated evaluation)
	Date: 2026-03-27
	Research Summary: [specs/F005-RESEARCH_SUMMARY.md](F005-RESEARCH_SUMMARY.md)
	Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
	Behavior Spec: Archived to [specs/behavior/evaluation.md](behavior/evaluation.md)

	Plan Status:
	- [x] Draft
	- [x] Approved for Implementation
	- [x] Implementation Complete
	- [x] Verification Passed

	---

	## Core Intent (Immutable)

	> DO NOT MODIFY THIS SECTION DURING REFINEMENT
	> Changes to Core Intent mean you're describing a different feature.
	> If refinement reveals the need to change this section, create a new feature instead.

	User Problem:
	Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained).

	Success Criteria:
	- Single function call: `evaluate(n_episodes=100)` returns clean metrics dict
	- Built-in random policy for instant baseline comparison
	- Results include per-episode breakdown for analysis

	Avoid:
	- Evaluation crashes partway through and loses all results
	- No progress indicator for long evaluation runs

	Out of Scope:
	- Visualization / plotting of results
	- WebSocket / remote environment support (local SQLEnvironment only)
	- Elaborate policy class hierarchy
	- Training loop integration (F006 will consume this API)

	---

	## 0. Slicing & Scope Budget (Anti-Waterfall)

	This spec must be executable in small, mergeable increments.

	### Scope Budget
	- Target: 2 slices
	- Hard max: <= 10 steps total
	- Each step must end in: implement -> verify -> merge

	### Slice Definition
	A slice is a vertical increment that delivers user-visible value or a safe internal capability.

	Each slice must have:
	- Clear outcome
	- Minimal interface change
	- Merge criteria

	Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

	## Status Icons

	Step Status:
	- ??? Not Started
	- ???? In Progress
	- ??? Completed
	- ???? Blocked/Failed

	Result Outcome:
	- ??? Fully Successful (all tests passed, no issues)
	- ?????? Completed with Issues (needs follow-up)
	- ???? Failed/Blocked

	---

	## 1. Implementation Overview

	### Summary

	Create an `evaluation/` subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a `Policy` protocol defining the interface for any policy, (2) an `EpisodeResult` dataclass for per-episode metrics, (3) an `EvaluationResult` dataclass for aggregate metrics, (4) a `RandomPolicy` class as a built-in baseline, and (5) an `evaluate()` function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified.

	### Scope

	In Scope:
	- `evaluation/__init__.py` -- public API re-exports
	- `evaluation/green_agent.py` -- Protocol, dataclasses, RandomPolicy, evaluate()
	- `tests/test_evaluation.py` -- unit + integration tests

	Out of Scope:
	- Modifications to `server/sql_environment.py` or `models.py`
	- CLI entry point (future feature)
	- Remote / WebSocket evaluation
	- Plotting or visualization

	---

	## 1a. Execution Status
	<!-- Auto-updated by /autocode-next-step - do not edit manually -->

	Progress: 4/4 steps complete
	Current Step: All planned implementation steps are complete
	Last Updated: 2026-03-28T00:04:03Z
	Latest Result: Fully Successful (Step 2.2 complete)
	Blockers: None

	---

	## 1b. Risk Assessment

	Risk Tier: [x] Low \| [ ] Medium \| [ ] High

	High-Risk Indicators Present: (check all that apply if tier is High)
	- [ ] Touches authentication or authorization logic
	- [ ] Handles payment processing or financial data
	- [ ] Manages secrets, API keys, or credentials
	- [ ] Processes untrusted user input (file uploads, external APIs)
	- [ ] Modifies privilege/permission systems

	Security Review Required: [ ] Yes (if High) \| [x] No

	Justification:
	Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns.

	---

	## 2. Change Manifest

	### Files to Create

	\| File \| Purpose \|
	\|------\|---------\|
	\| `evaluation/__init__.py` \| Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate \|
	\| `evaluation/green_agent.py` \| Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate() \|
	\| `tests/test_evaluation.py` \| Unit tests for types + RandomPolicy, integration test with SQLEnvironment \|

	### Files to Modify

	None.

	### Files to Delete

	None.

	---

	## 3. Interface Specifications

	### New Types

	```python
	# Location: evaluation/green_agent.py

	from dataclasses import dataclass, field
	from typing import Protocol, runtime_checkable

	@runtime_checkable
	class Policy(Protocol):
	"""Interface for any evaluation policy.

	Any object with a select_action method matching this signature
	is a valid policy (structural subtyping / duck typing).
	"""

	def select_action(self, observation: SQLObservation) -> SQLAction:
	"""Choose an action given an observation."""
	...


	@dataclass(frozen=True)
	class EpisodeResult:
	"""Per-episode evaluation metrics."""

	episode_index: int # 0-based episode number
	correct: bool # Whether ANSWER action matched gold
	total_reward: float # Cumulative reward for the episode
	steps: int # Number of steps taken
	error: str \| None = None # Error message if episode failed


	@dataclass(frozen=True)
	class EvaluationResult:
	"""Aggregate evaluation metrics with per-episode breakdown."""

	success_rate: float # Fraction of correct episodes [0.0, 1.0]
	avg_reward: float # Mean total_reward across episodes
	avg_steps: float # Mean steps across episodes
	n_episodes: int # Number of episodes attempted
	n_completed: int # Episodes that ran to completion (no error)
	episodes: list[EpisodeResult] # Per-episode breakdown
	```

	### New Functions

	```python
	# Location: evaluation/green_agent.py

	class RandomPolicy:
	"""Built-in random baseline policy.

	Selects random action types and arguments. Deterministic given a seed.
	"""

	def __init__(self, seed: int \| None = None) -> None:
	"""
	Args:
	seed: Random seed for reproducibility. None = non-deterministic.
	"""

	def select_action(self, observation: SQLObservation) -> SQLAction:
	"""Pick a random action based on current observation.

	Strategy:
	- If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY
	- If budget_remaining == 1: always ANSWER with a random guess
	- DESCRIBE/SAMPLE: pick a random table from schema_info
	- QUERY: generate a simple SELECT * FROM <table> LIMIT 5
	- ANSWER: pick a random value from last result or "unknown"

	Args:
	observation: Current environment observation

	Returns:
	A random SQLAction
	"""


	def evaluate(
	env: SQLEnvironment,
	policy: Policy,
	n_episodes: int = 100,
	*,
	seed: int \| None = None,
	progress_callback: Callable[[int, int], None] \| None = None,
	) -> EvaluationResult:
	"""Run automated evaluation of a policy over multiple episodes.

	Collects results incrementally -- if an episode fails, it is recorded
	as an error and evaluation continues with the next episode.

	Args:
	env: The SQLEnvironment instance to evaluate against.
	policy: Any object satisfying the Policy protocol.
	n_episodes: Number of episodes to run (0 returns empty result).
	seed: Base seed for reproducibility. Episode i uses seed+i.
	progress_callback: Optional callback(current, total) for progress.

	Returns:
	EvaluationResult with aggregate metrics and per-episode breakdown.

	Raises:
	ValueError: If n_episodes < 0.
	"""
	```

	---

	## 4. Data Flow

	### Primary Flow

	```
	1. evaluate(env, policy, n_episodes=100, seed=42)
	- Input: environment, policy, episode count, optional seed

	2. For each episode i in range(n_episodes):
	a. obs = env.reset(seed=seed+i if seed else None)
	b. While not obs.done:
	- action = policy.select_action(obs)
	- obs = env.step(action)
	- Accumulate reward
	c. Record EpisodeResult(correct=..., total_reward=..., steps=...)
	d. Call progress_callback(i+1, n_episodes) if provided

	3. Aggregate results:
	- success_rate = sum(correct) / n_completed
	- avg_reward = mean(total_reward) across completed
	- avg_steps = mean(steps) across completed

	4. Return EvaluationResult
	```

	### Alternative Flows

	When n_episodes=0:
	```
	1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0,
	avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])
	```

	When episode raises exception:
	```
	1. Catch exception in the episode loop
	2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0,
	error=str(exception))
	3. Continue to next episode
	```

	When env.reset() fails:
	```
	1. Catch exception
	2. Record EpisodeResult with error, steps=0
	3. Continue to next episode
	```

	---

	## 5. Error Handling

	### Error Types

	\| Error \| When \| Handling \|
	\|-------\|------\|----------\|
	\| `ValueError` \| `n_episodes < 0` \| Raise immediately \|
	\| `Exception` during `env.reset()` \| DB not found, bad questions file \| Catch, record as failed episode, continue \|
	\| `Exception` during `policy.select_action()` \| Policy bug \| Catch, record as failed episode, continue \|
	\| `Exception` during `env.step()` \| Environment bug \| Catch, record as failed episode, continue \|

	### Error Handling Strategy

	```python
	# Pattern: incremental collection with per-episode error isolation
	for i in range(n_episodes):
	try:
	obs = env.reset(seed=episode_seed)
	total_reward = 0.0
	steps = 0
	while not obs.done:
	action = policy.select_action(obs)
	obs = env.step(action)
	total_reward += obs.reward or 0.0
	steps += 1
	episodes.append(EpisodeResult(
	episode_index=i,
	correct=_check_correct(obs),
	total_reward=total_reward,
	steps=steps,
	))
	except Exception as exc:
	episodes.append(EpisodeResult(
	episode_index=i,
	correct=False,
	total_reward=0.0,
	steps=0,
	error=str(exc),
	))
	```

	### Retry Strategy

	\| Operation \| Retry? \| Strategy \|
	\|-----------\|--------\|----------\|
	\| Episode evaluation \| No \| Record error, move to next episode \|
	\| Environment reset \| No \| Record error, move to next episode \|

	---

	## 6. Slice Plan (What we will ship, in order)

	### Slice S1 -- Types, Protocol, and RandomPolicy
	Value: Establishes the evaluation interface and provides a usable random baseline
	User-visible change: Yes -- users can instantiate RandomPolicy and call select_action
	Interfaces introduced/changed: Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy
	Rollback safety: Purely additive -- new files only, no changes to existing code

	### Slice S2 -- evaluate() Function and Integration Test
	Value: Users can run `evaluate(env, random_policy, n_episodes=100)` and get structured metrics
	User-visible change: Yes -- the core capability is now available
	Interfaces introduced/changed: evaluate() function
	Rollback safety: Purely additive -- extends S1 files, no changes to existing code

	---

	## 7. Implementation Steps

	> VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md.
	> The verification-planner (separate agent) generated independent test criteria.
	> Run the tests specified there after implementing each step.

	### Step 1.1: Types and Protocol
	Slice: S1
	Goal: Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass.

	Files:
	- `evaluation/__init__.py` - create - empty init with re-exports
	- `evaluation/green_agent.py` - create - Protocol + dataclasses (no functions yet)

	Interface Changes:
	- New: `Policy` protocol with `select_action(observation: SQLObservation) -> SQLAction`
	- New: `EpisodeResult` frozen dataclass
	- New: `EvaluationResult` frozen dataclass

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: [x] Low \| [ ] Medium \| [ ] High

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: ??? Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T23:51:09Z
	Changes Made:
	- Created `evaluation/__init__.py` with public re-exports for `Policy`, `EpisodeResult`, and `EvaluationResult`.
	- Created `evaluation/green_agent.py` with the `Policy` runtime-checkable protocol and frozen `EpisodeResult`/`EvaluationResult` dataclasses.

	Result:
	- Outcome: ???
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/ -v
	Result: 100 passed, 1 skipped
	Scope: full project regression run after adding new evaluation types
	```
	- Tests run: `uv run --with pytest pytest tests/ -v`
	- Notes:
	- Dataclass and protocol scaffolding is additive and isolated to a new package.
	- `pytest` is not installed in the project environment yet, so verification used `uv run --with pytest` for this step.
	- Import fallback mirrors existing package-vs-standalone test collection behavior in the repo.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: ??? N/A

	Context for Next Step:
	- Types are defined and importable from `evaluation`

	---

	### Step 1.2: RandomPolicy Implementation
	Slice: S1
	Goal: Implement the RandomPolicy class that selects random actions based on observation state.

	Files:
	- `evaluation/green_agent.py` - modify - add RandomPolicy class

	Interface Changes:
	- New: `RandomPolicy.__init__(seed: int \| None = None)`
	- New: `RandomPolicy.select_action(observation: SQLObservation) -> SQLAction`

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: [x] Low \| [ ] Medium \| [ ] High

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: ??? Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T23:55:10Z
	Changes Made:
	- Implemented `RandomPolicy` in `evaluation/green_agent.py` with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction.
	- Updated `evaluation/__init__.py` to re-export `RandomPolicy` from the public evaluation API.
	- Added `tests/test_evaluation.py` with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction).

	Result:
	- Outcome: ???
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/test_evaluation.py -v
	Result: 6 passed
	Scope: RandomPolicy unit coverage for F005 Step 1.2

	Command: uv run --with pytest pytest tests/ -v
	Result: 106 passed, 1 skipped
	Scope: Full regression after RandomPolicy implementation
	```
	- Tests run: `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
	- Notes:
	- RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step.
	- Schema parsing intentionally handles both `- table` and `- table: columns...` observation formats.
	- Verification commands in the spec referenced `tests/unit/...`; this repo uses a flat `tests/` layout, so tests were added in `tests/test_evaluation.py`.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: ??? N/A

	Context for Next Step:
	- RandomPolicy is implemented and exported from the public `evaluation` API
	- Ready to implement `evaluate()` using per-episode loop and error isolation

	---

	### Step 2.1: evaluate() Function
	Slice: S2
	Goal: Implement the core evaluate() function with incremental collection and error isolation.

	Files:
	- `evaluation/green_agent.py` - modify - add evaluate() function
	- `evaluation/__init__.py` - modify - add evaluate to re-exports

	Interface Changes:
	- New: `evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult`

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: [x] Low \| [ ] Medium \| [ ] High

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: ??? Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-27T23:59:28Z
	Changes Made:
	- Added `evaluate()` to `evaluation/green_agent.py` with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation.
	- Added `evaluate` to `evaluation/__init__.py` public exports.
	- Extended `tests/test_evaluation.py` with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior.

	Result:
	- Outcome: ???
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/test_evaluation.py -v
	Result: 14 passed
	Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1

	Command: uv run --with pytest pytest tests/ -v
	Result: 114 passed, 1 skipped
	Scope: Full regression after evaluate() implementation
	```
	- Tests run: `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
	- Notes:
	- evaluate() computes aggregates using completed episodes only (`error is None`), matching the error-isolation behavior in the spec data flow.
	- Progress callback is invoked once per attempted episode, including episodes that fail.
	- Repository environment still does not include pytest by default, so verification used `uv run --with pytest`.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: ??? N/A

	Context for Next Step:
	- evaluate() is implemented, exported, and covered by focused unit tests
	- Next step should add/expand integration coverage with a real `SQLEnvironment` evaluation run

	---

	### Step 2.2: Integration Test with SQLEnvironment
	Slice: S2
	Goal: Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment.

	Files:
	- `tests/test_evaluation.py` - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env

	Interface Changes:
	None (test-only step).

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: [x] Low \| [ ] Medium \| [ ] High

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	<!-- Filled by /autocode-next-step after implementation -->
	Completed: 2026-03-28T00:04:03Z
	Changes Made:
	- Added `_build_sql_environment()` test helper in `tests/test_evaluation.py` to spin up a real SQLite-backed `SQLEnvironment` with a deterministic question fixture.
	- Added `test_evaluate_integration_with_sql_environment` validating end-to-end `evaluate()` execution over 10 episodes with aggregate-metric consistency checks.
	- Added `test_evaluate_integration_is_deterministic_with_seeds` validating deterministic full-result equality when both policy and environment seeds are fixed.

	Result:
	- Outcome: Fully Successful
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/test_evaluation.py -v
	Result: 16 passed
	Scope: evaluation unit + integration coverage including real SQLEnvironment flow

	Command: uv run --with pytest pytest tests/ -v
	Result: 116 passed, 1 skipped
	Scope: full project regression after adding integration coverage
	```
	- Tests run: `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
	- Notes:
	- Integration tests were implemented in `tests/test_evaluation.py` to match this repository's flat test layout.
	- Verifier gate approved finalization in MVP mode after test evidence review.
	- Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- All implementation steps are complete and verification gate passed.

	---

	## 8. Rollout Considerations

	### Feature Flags
	- [x] Required: No
	- [ ] Flag name: N/A

	### Migration
	- [x] Data migration needed: No
	- [ ] Migration strategy: N/A

	### Rollback Plan
	Delete the `evaluation/` directory. No other code references it.

	---

	## 9. Execution Tracking

	All execution state is tracked within this document:
	- Section 1a: Overall progress summary
	- Section 7: Per-step completion details, test results, and handoff context
	- FEATURES.json: Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
	- Git history: Full audit trail of changes to this file

	The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
	- Checking Section 1a for summary
	- Reviewing Section 7 for detailed step status
	- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
	- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

	---

	## 9a. Slice Completion Protocol

	After all steps in a slice pass verification:

	1. Run verifier subagent for spec compliance
	- Validates against VERIFICATION_SPEC.md criteria
	- Ensures no TODOs or incomplete work in slice

	2. Run compound-engineer subagent to extract learnings
	- Mandatory invocation after every slice completion
	- Updates CLAUDE.md Learnings section (if durable patterns found)
	- May exit with "no update needed" (valid for routine work)

	3. Commit the slice changes
	- Follow commit message format in CLAUDE.md
	- Each slice gets its own atomic commit

	4. Continue to next slice (if more slices remain)
	- Or proceed to final verification if all slices complete

	Note: PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

	---

	## 10. User Value Summary

	<!-- Populated by /autocode-next-step when final step completes -->

	Status: Generated

	### What Users Can Now Do
	Run automated evaluation of any policy over N episodes with `evaluate(env, policy, n_episodes=100)` and get structured metrics including success rate, average reward, average steps, and per-episode breakdown.

	### How to Access/Test
	```python
	from evaluation import evaluate, RandomPolicy
	from server.sql_environment import SQLEnvironment

	env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer)
	policy = RandomPolicy(seed=42)
	result = evaluate(env, policy, n_episodes=10, seed=42)
	print(f"Success rate: {result.success_rate:.1%}")
	print(f"Avg reward: {result.avg_reward:.3f}")
	```

	### Demo
	- Command: `uv run python -c "from evaluation import evaluate, RandomPolicy; ..."`

	### Release Notes Snippet
	Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance.

	---

	## 11. PR Contract (Auto-Generated by autocode-next-step)

	<!-- This section is auto-populated by autocode-next-step command when all steps complete -->

	Status: Generated

	### PR Title
	feat(evaluation): complete green agent wrapper integration and finalization

	### PR Summary
	- Add deterministic integration coverage for `evaluate()` against a real `SQLEnvironment` fixture.
	- Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation.
	- Capture durable learnings under `docs/learnings/` for evaluation patterns and deterministic testing.

	### Verification
	- `uv run --with pytest pytest tests/test_evaluation.py -v`
	- `uv run --with pytest pytest tests/ -v`

	### Follow-up
	All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10

	---

	## Stop Conditions (When to Split This Spec)

	Stop and create a new IMPLEMENTATION_SPEC if:
	- A step requires touching more than 3 files in unrelated areas
	- You need to introduce multiple new abstractions "just in case"
	- Verification cannot be made targeted and concrete
	- You discover new unknowns that change the plan materially
	- The next slice cannot be merged safely without finishing later slices

	When splitting, ensure the current slice ends in a merged, stable state.

	---

	## Human Checkpoint

	Before handing to AI agent:

	- [ ] Interface specifications are complete
	- [ ] Data flow is accurate
	- [ ] Error handling is specified
	- [ ] Implementation order makes sense
	- [ ] VERIFICATION_SPEC.md has been generated

	Questions:
	1. Any remaining concerns?
	2. Anything agent should know?

	---

	## Handoff Notes

	For the implementing AI agent:

	```
	Context: See RESEARCH_SUMMARY.md for system understanding
	Spec: Follow this document exactly
	Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
	Ambiguity: Stop and ask rather than assume
	Order: Follow implementation order exactly
	```

	---

	Specification completed: 2026-03-27
	Approved by: --
	Verification spec: VERIFICATION_SPEC.md
	Verification input: [F005-VERIFICATION_INPUT.json](F005-VERIFICATION_INPUT.json)
	Target agent: Claude Code