Spaces:

hjerpe
/

sql_env

Sleeping

File size: 26,286 Bytes

5dd1bb4

# Implementation Specification

**Change:** F005 -- Green Agent Wrapper (automated evaluation)
**Date:** 2026-03-27
**Research Summary:** [specs/F005-RESEARCH_SUMMARY.md](F005-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Spec:** Archived to [specs/behavior/evaluation.md](behavior/evaluation.md)

**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed

---

## Core Intent (Immutable)

> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you're describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.

**User Problem:**
Run automated evaluation: "How does policy X perform over 100 episodes?" Single command, structured output. Enables training comparison (random vs trained).

**Success Criteria:**
- Single function call: `evaluate(n_episodes=100)` returns clean metrics dict
- Built-in random policy for instant baseline comparison
- Results include per-episode breakdown for analysis

**Avoid:**
- Evaluation crashes partway through and loses all results
- No progress indicator for long evaluation runs

**Out of Scope:**
- Visualization / plotting of results
- WebSocket / remote environment support (local SQLEnvironment only)
- Elaborate policy class hierarchy
- Training loop integration (F006 will consume this API)

---

## 0. Slicing & Scope Budget (Anti-Waterfall)

This spec must be executable in **small, mergeable increments**.

### Scope Budget
- Target: **2 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**

### Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.

**Each slice must have:**
- Clear outcome
- Minimal interface change
- Merge criteria

**Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

## Status Icons

**Step Status:**
- ??? Not Started
- ???? In Progress
- ??? Completed
- ???? Blocked/Failed

**Result Outcome:**
- ??? Fully Successful (all tests passed, no issues)
- ?????? Completed with Issues (needs follow-up)
- ???? Failed/Blocked

---

## 1. Implementation Overview

### Summary

Create an `evaluation/` subpackage containing the automated evaluation wrapper for SQLEnv. The package provides: (1) a `Policy` protocol defining the interface for any policy, (2) an `EpisodeResult` dataclass for per-episode metrics, (3) an `EvaluationResult` dataclass for aggregate metrics, (4) a `RandomPolicy` class as a built-in baseline, and (5) an `evaluate()` function that runs N episodes, collects results incrementally (surviving partial failures), and returns structured metrics. The module is purely additive -- no existing code is modified.

### Scope

**In Scope:**
- `evaluation/__init__.py` -- public API re-exports
- `evaluation/green_agent.py` -- Protocol, dataclasses, RandomPolicy, evaluate()
- `tests/test_evaluation.py` -- unit + integration tests

**Out of Scope:**
- Modifications to `server/sql_environment.py` or `models.py`
- CLI entry point (future feature)
- Remote / WebSocket evaluation
- Plotting or visualization

---

## 1a. Execution Status
<!-- Auto-updated by /autocode-next-step - do not edit manually -->

**Progress:** 4/4 steps complete
**Current Step:** All planned implementation steps are complete
**Last Updated:** 2026-03-28T00:04:03Z
**Latest Result:** Fully Successful (Step 2.2 complete)
**Blockers:** None

---

## 1b. Risk Assessment

**Risk Tier:** [x] Low | [ ] Medium | [ ] High

**High-Risk Indicators Present:** (check all that apply if tier is High)
- [ ] Touches authentication or authorization logic
- [ ] Handles payment processing or financial data
- [ ] Manages secrets, API keys, or credentials
- [ ] Processes untrusted user input (file uploads, external APIs)
- [ ] Modifies privilege/permission systems

**Security Review Required:** [ ] Yes (if High) | [x] No

**Justification:**
Pure additive feature. Client-side evaluation loop that reads from the existing environment API. No security, auth, or data mutation concerns.

---

## 2. Change Manifest

### Files to Create

| File | Purpose |
|------|---------|
| `evaluation/__init__.py` | Public API: re-exports Policy, RandomPolicy, EpisodeResult, EvaluationResult, evaluate |
| `evaluation/green_agent.py` | Core evaluation logic: Protocol, dataclasses, RandomPolicy, evaluate() |
| `tests/test_evaluation.py` | Unit tests for types + RandomPolicy, integration test with SQLEnvironment |

### Files to Modify

None.

### Files to Delete

None.

---

## 3. Interface Specifications

### New Types

```python
# Location: evaluation/green_agent.py

from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable

@runtime_checkable
class Policy(Protocol):
    """Interface for any evaluation policy.

    Any object with a select_action method matching this signature
    is a valid policy (structural subtyping / duck typing).
    """

    def select_action(self, observation: SQLObservation) -> SQLAction:
        """Choose an action given an observation."""
        ...


@dataclass(frozen=True)
class EpisodeResult:
    """Per-episode evaluation metrics."""

    episode_index: int          # 0-based episode number
    correct: bool               # Whether ANSWER action matched gold
    total_reward: float         # Cumulative reward for the episode
    steps: int                  # Number of steps taken
    error: str | None = None    # Error message if episode failed


@dataclass(frozen=True)
class EvaluationResult:
    """Aggregate evaluation metrics with per-episode breakdown."""

    success_rate: float                 # Fraction of correct episodes [0.0, 1.0]
    avg_reward: float                   # Mean total_reward across episodes
    avg_steps: float                    # Mean steps across episodes
    n_episodes: int                     # Number of episodes attempted
    n_completed: int                    # Episodes that ran to completion (no error)
    episodes: list[EpisodeResult]       # Per-episode breakdown
```

### New Functions

```python
# Location: evaluation/green_agent.py

class RandomPolicy:
    """Built-in random baseline policy.

    Selects random action types and arguments. Deterministic given a seed.
    """

    def __init__(self, seed: int | None = None) -> None:
        """
        Args:
            seed: Random seed for reproducibility. None = non-deterministic.
        """

    def select_action(self, observation: SQLObservation) -> SQLAction:
        """Pick a random action based on current observation.

        Strategy:
        - If budget_remaining > 1: randomly choose DESCRIBE, SAMPLE, or QUERY
        - If budget_remaining == 1: always ANSWER with a random guess
        - DESCRIBE/SAMPLE: pick a random table from schema_info
        - QUERY: generate a simple SELECT * FROM <table> LIMIT 5
        - ANSWER: pick a random value from last result or "unknown"

        Args:
            observation: Current environment observation

        Returns:
            A random SQLAction
        """


def evaluate(
    env: SQLEnvironment,
    policy: Policy,
    n_episodes: int = 100,
    *,
    seed: int | None = None,
    progress_callback: Callable[[int, int], None] | None = None,
) -> EvaluationResult:
    """Run automated evaluation of a policy over multiple episodes.

    Collects results incrementally -- if an episode fails, it is recorded
    as an error and evaluation continues with the next episode.

    Args:
        env: The SQLEnvironment instance to evaluate against.
        policy: Any object satisfying the Policy protocol.
        n_episodes: Number of episodes to run (0 returns empty result).
        seed: Base seed for reproducibility. Episode i uses seed+i.
        progress_callback: Optional callback(current, total) for progress.

    Returns:
        EvaluationResult with aggregate metrics and per-episode breakdown.

    Raises:
        ValueError: If n_episodes < 0.
    """
```

---

## 4. Data Flow

### Primary Flow

```
1. evaluate(env, policy, n_episodes=100, seed=42)
   - Input: environment, policy, episode count, optional seed

2. For each episode i in range(n_episodes):
   a. obs = env.reset(seed=seed+i if seed else None)
   b. While not obs.done:
      - action = policy.select_action(obs)
      - obs = env.step(action)
      - Accumulate reward
   c. Record EpisodeResult(correct=..., total_reward=..., steps=...)
   d. Call progress_callback(i+1, n_episodes) if provided

3. Aggregate results:
   - success_rate = sum(correct) / n_completed
   - avg_reward = mean(total_reward) across completed
   - avg_steps = mean(steps) across completed

4. Return EvaluationResult
```

### Alternative Flows

**When n_episodes=0:**
```
1. Return EvaluationResult(success_rate=0.0, avg_reward=0.0,
     avg_steps=0.0, n_episodes=0, n_completed=0, episodes=[])
```

**When episode raises exception:**
```
1. Catch exception in the episode loop
2. Record EpisodeResult(correct=False, total_reward=0.0, steps=0,
     error=str(exception))
3. Continue to next episode
```

**When env.reset() fails:**
```
1. Catch exception
2. Record EpisodeResult with error, steps=0
3. Continue to next episode
```

---

## 5. Error Handling

### Error Types

| Error | When | Handling |
|-------|------|----------|
| `ValueError` | `n_episodes < 0` | Raise immediately |
| `Exception` during `env.reset()` | DB not found, bad questions file | Catch, record as failed episode, continue |
| `Exception` during `policy.select_action()` | Policy bug | Catch, record as failed episode, continue |
| `Exception` during `env.step()` | Environment bug | Catch, record as failed episode, continue |

### Error Handling Strategy

```python
# Pattern: incremental collection with per-episode error isolation
for i in range(n_episodes):
    try:
        obs = env.reset(seed=episode_seed)
        total_reward = 0.0
        steps = 0
        while not obs.done:
            action = policy.select_action(obs)
            obs = env.step(action)
            total_reward += obs.reward or 0.0
            steps += 1
        episodes.append(EpisodeResult(
            episode_index=i,
            correct=_check_correct(obs),
            total_reward=total_reward,
            steps=steps,
        ))
    except Exception as exc:
        episodes.append(EpisodeResult(
            episode_index=i,
            correct=False,
            total_reward=0.0,
            steps=0,
            error=str(exc),
        ))
```

### Retry Strategy

| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Episode evaluation | No | Record error, move to next episode |
| Environment reset | No | Record error, move to next episode |

---

## 6. Slice Plan (What we will ship, in order)

### Slice S1 -- Types, Protocol, and RandomPolicy
**Value:** Establishes the evaluation interface and provides a usable random baseline
**User-visible change:** Yes -- users can instantiate RandomPolicy and call select_action
**Interfaces introduced/changed:** Policy protocol, EpisodeResult, EvaluationResult, RandomPolicy
**Rollback safety:** Purely additive -- new files only, no changes to existing code

### Slice S2 -- evaluate() Function and Integration Test
**Value:** Users can run `evaluate(env, random_policy, n_episodes=100)` and get structured metrics
**User-visible change:** Yes -- the core capability is now available
**Interfaces introduced/changed:** evaluate() function
**Rollback safety:** Purely additive -- extends S1 files, no changes to existing code

---

## 7. Implementation Steps

> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.

### Step 1.1: Types and Protocol
**Slice:** S1
**Goal:** Define the Policy protocol, EpisodeResult dataclass, and EvaluationResult dataclass.

**Files:**
- `evaluation/__init__.py` - create - empty init with re-exports
- `evaluation/green_agent.py` - create - Protocol + dataclasses (no functions yet)

**Interface Changes:**
- New: `Policy` protocol with `select_action(observation: SQLObservation) -> SQLAction`
- New: `EpisodeResult` frozen dataclass
- New: `EvaluationResult` frozen dataclass

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** ??? Completed

<!-- Filled by /autocode-next-step after implementation -->
**Completed:** 2026-03-27T23:51:09Z
**Changes Made:**
- Created `evaluation/__init__.py` with public re-exports for `Policy`, `EpisodeResult`, and `EvaluationResult`.
- Created `evaluation/green_agent.py` with the `Policy` runtime-checkable protocol and frozen `EpisodeResult`/`EvaluationResult` dataclasses.

**Result:**
- **Outcome:** ???
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/ -v
  Result: 100 passed, 1 skipped
  Scope: full project regression run after adding new evaluation types
  ```
- **Tests run:** `uv run --with pytest pytest tests/ -v`
- **Notes:**
  - Dataclass and protocol scaffolding is additive and isolated to a new package.
  - `pytest` is not installed in the project environment yet, so verification used `uv run --with pytest` for this step.
  - Import fallback mirrors existing package-vs-standalone test collection behavior in the repo.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** ??? N/A

**Context for Next Step:**
- Types are defined and importable from `evaluation`

---

### Step 1.2: RandomPolicy Implementation
**Slice:** S1
**Goal:** Implement the RandomPolicy class that selects random actions based on observation state.

**Files:**
- `evaluation/green_agent.py` - modify - add RandomPolicy class

**Interface Changes:**
- New: `RandomPolicy.__init__(seed: int | None = None)`
- New: `RandomPolicy.select_action(observation: SQLObservation) -> SQLAction`

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** ??? Completed

<!-- Filled by /autocode-next-step after implementation -->
**Completed:** 2026-03-27T23:55:10Z
**Changes Made:**
- Implemented `RandomPolicy` in `evaluation/green_agent.py` with seed-controlled randomness, budget-aware action selection, schema table parsing, and row-based answer candidate extraction.
- Updated `evaluation/__init__.py` to re-export `RandomPolicy` from the public evaluation API.
- Added `tests/test_evaluation.py` with focused RandomPolicy behavior tests (exploration vs answer mode, determinism, action type coverage, and answer extraction).

**Result:**
- **Outcome:** ???
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/test_evaluation.py -v
  Result: 6 passed
  Scope: RandomPolicy unit coverage for F005 Step 1.2

  Command: uv run --with pytest pytest tests/ -v
  Result: 106 passed, 1 skipped
  Scope: Full regression after RandomPolicy implementation
  ```
- **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
- **Notes:**
  - RandomPolicy always explores with DESCRIBE/SAMPLE/QUERY while budget remains and forces ANSWER on the last step.
  - Schema parsing intentionally handles both `- table` and `- table: columns...` observation formats.
  - Verification commands in the spec referenced `tests/unit/...`; this repo uses a flat `tests/` layout, so tests were added in `tests/test_evaluation.py`.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** ??? N/A

**Context for Next Step:**
- RandomPolicy is implemented and exported from the public `evaluation` API
- Ready to implement `evaluate()` using per-episode loop and error isolation

---

### Step 2.1: evaluate() Function
**Slice:** S2
**Goal:** Implement the core evaluate() function with incremental collection and error isolation.

**Files:**
- `evaluation/green_agent.py` - modify - add evaluate() function
- `evaluation/__init__.py` - modify - add evaluate to re-exports

**Interface Changes:**
- New: `evaluate(env, policy, n_episodes, *, seed, progress_callback) -> EvaluationResult`

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** ??? Completed

<!-- Filled by /autocode-next-step after implementation -->
**Completed:** 2026-03-27T23:59:28Z
**Changes Made:**
- Added `evaluate()` to `evaluation/green_agent.py` with per-episode reset/step loop, seed+i reset behavior, progress callback support, and per-episode error isolation.
- Added `evaluate` to `evaluation/__init__.py` public exports.
- Extended `tests/test_evaluation.py` with unit coverage for evaluate happy path, zero/negative episodes, seed propagation, exception handling, aggregate calculations, and progress callback behavior.

**Result:**
- **Outcome:** ???
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/test_evaluation.py -v
  Result: 14 passed
  Scope: RandomPolicy + evaluate() unit coverage for F005 Step 2.1

  Command: uv run --with pytest pytest tests/ -v
  Result: 114 passed, 1 skipped
  Scope: Full regression after evaluate() implementation
  ```
- **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
- **Notes:**
  - evaluate() computes aggregates using completed episodes only (`error is None`), matching the error-isolation behavior in the spec data flow.
  - Progress callback is invoked once per attempted episode, including episodes that fail.
  - Repository environment still does not include pytest by default, so verification used `uv run --with pytest`.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** ??? N/A

**Context for Next Step:**
- evaluate() is implemented, exported, and covered by focused unit tests
- Next step should add/expand integration coverage with a real `SQLEnvironment` evaluation run

---

### Step 2.2: Integration Test with SQLEnvironment
**Slice:** S2
**Goal:** Write integration test that runs evaluate() with RandomPolicy against a real SQLEnvironment.

**Files:**
- `tests/test_evaluation.py` - create - unit tests for types + RandomPolicy + evaluate(); integration test with real env

**Interface Changes:**
None (test-only step).

**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

**Risk Tier for This Step:** [x] Low | [ ] Medium | [ ] High

**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)

**Status:** Completed

<!-- Filled by /autocode-next-step after implementation -->
**Completed:** 2026-03-28T00:04:03Z
**Changes Made:**
- Added `_build_sql_environment()` test helper in `tests/test_evaluation.py` to spin up a real SQLite-backed `SQLEnvironment` with a deterministic question fixture.
- Added `test_evaluate_integration_with_sql_environment` validating end-to-end `evaluate()` execution over 10 episodes with aggregate-metric consistency checks.
- Added `test_evaluate_integration_is_deterministic_with_seeds` validating deterministic full-result equality when both policy and environment seeds are fixed.

**Result:**
- **Outcome:** Fully Successful
- **Evidence Captured:**
  ```
  Command: uv run --with pytest pytest tests/test_evaluation.py -v
  Result: 16 passed
  Scope: evaluation unit + integration coverage including real SQLEnvironment flow

  Command: uv run --with pytest pytest tests/ -v
  Result: 116 passed, 1 skipped
  Scope: full project regression after adding integration coverage
  ```
- **Tests run:** `uv run --with pytest pytest tests/test_evaluation.py -v`; `uv run --with pytest pytest tests/ -v`
- **Notes:**
- Integration tests were implemented in `tests/test_evaluation.py` to match this repository's flat test layout.
- Verifier gate approved finalization in MVP mode after test evidence review.
- Reviewer auto-step was skipped by policy because risk tier is Low, tests passed, and no security-sensitive surfaces changed.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A

**Context for Next Step:**
- All implementation steps are complete and verification gate passed.

---

## 8. Rollout Considerations

### Feature Flags
- [x] Required: No
- [ ] Flag name: N/A

### Migration
- [x] Data migration needed: No
- [ ] Migration strategy: N/A

### Rollback Plan
Delete the `evaluation/` directory. No other code references it.

---

## 9. Execution Tracking

All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file

The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

---

## 9a. Slice Completion Protocol

After all steps in a slice pass verification:

1. **Run verifier subagent** for spec compliance
   - Validates against VERIFICATION_SPEC.md criteria
   - Ensures no TODOs or incomplete work in slice

2. **Run compound-engineer subagent** to extract learnings
   - **Mandatory invocation** after every slice completion
   - Updates CLAUDE.md Learnings section (if durable patterns found)
   - May exit with "no update needed" (valid for routine work)

3. **Commit** the slice changes
   - Follow commit message format in CLAUDE.md
   - Each slice gets its own atomic commit

4. **Continue to next slice** (if more slices remain)
   - Or proceed to final verification if all slices complete

**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

---

## 10. User Value Summary

<!-- Populated by /autocode-next-step when final step completes -->

**Status:** Generated

### What Users Can Now Do
Run automated evaluation of any policy over N episodes with `evaluate(env, policy, n_episodes=100)` and get structured metrics including success rate, average reward, average steps, and per-episode breakdown.

### How to Access/Test
```python
from evaluation import evaluate, RandomPolicy
from server.sql_environment import SQLEnvironment

env = SQLEnvironment(questions_path="...", db_dir="...", tokenizer=tokenizer)
policy = RandomPolicy(seed=42)
result = evaluate(env, policy, n_episodes=10, seed=42)
print(f"Success rate: {result.success_rate:.1%}")
print(f"Avg reward: {result.avg_reward:.3f}")
```

### Demo
- **Command:** `uv run python -c "from evaluation import evaluate, RandomPolicy; ..."`

### Release Notes Snippet
Added automated evaluation wrapper with built-in random baseline policy for benchmarking agent performance.

---

## 11. PR Contract (Auto-Generated by autocode-next-step)

<!-- This section is auto-populated by autocode-next-step command when all steps complete -->

**Status:** Generated

### PR Title
feat(evaluation): complete green agent wrapper integration and finalization

### PR Summary
- Add deterministic integration coverage for `evaluate()` against a real `SQLEnvironment` fixture.
- Finalize F005 with full regression evidence, verifier approval, and archived behavior documentation.
- Capture durable learnings under `docs/learnings/` for evaluation patterns and deterministic testing.

### Verification
- `uv run --with pytest pytest tests/test_evaluation.py -v`
- `uv run --with pytest pytest tests/ -v`

### Follow-up
All steps completed. PR Created: https://github.com/hjerpe/sql-env/pull/10

---

## Stop Conditions (When to Split This Spec)

Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices

When splitting, ensure the current slice ends in a merged, stable state.

---

## Human Checkpoint

**Before handing to AI agent:**

- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated

**Questions:**
1. Any remaining concerns?
2. Anything agent should know?

---

## Handoff Notes

**For the implementing AI agent:**

```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
```

---

*Specification completed: 2026-03-27*
*Approved by: --*
*Verification spec: VERIFICATION_SPEC.md*
*Verification input: [F005-VERIFICATION_INPUT.json](F005-VERIFICATION_INPUT.json)*
*Target agent: Claude Code*