sql_env / specs /F003-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
# Implementation Specification
**Change:** F003 -- Dense Reward System (3-layer reward architecture)
**Date:** 2026-03-27
**Research Summary:** [specs/F003-RESEARCH_SUMMARY.md](F003-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Delta:** Archived to [specs/behavior/sql-environment.md](behavior/sql-environment.md)
**PR:** https://github.com/hjerpe/sql-env/pull/9
**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed
---
## Core Intent (Immutable)
> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you're describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.
**User Problem:**
Agents get meaningful feedback during exploration -- not just 0/1 at the end. A query that returns 40 when the answer is 42 gets partial credit. Discovering new schema info gets a small reward. This makes GRPO training converge.
**Success Criteria:**
- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
- Anti-gaming works: agent cannot farm rewards by repeating queries or describing everything
- Progress signal coarsened to 5 bins to prevent reward hill-climbing
**Avoid:**
- Reward hacking (agent exploiting shaping signals to inflate reward without solving the task)
- Reward too sparse (no signal until terminal step defeats the purpose of dense rewards)
- Over-complex reward that is hard to debug (keep each layer simple and independently testable)
**Out of Scope:**
- Adaptive/learned reward weights (use fixed weights: 0.25/0.50/0.25)
- Row-wise best-match alignment (add later if training shows need)
- NumPy/SciPy dependencies (pure Python only)
- Reward strategy classes or plugin architecture
- F002 verifier integration (Layer 3 uses existing naive check)
---
## 0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in **small, mergeable increments**.
### Scope Budget
- Target: **3 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**
### Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
**Each slice must have:**
- Clear outcome
- Minimal interface change
- Merge criteria
**Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
## Status Icons
**Step Status:**
- [ ] Not Started
- [~] In Progress
- [x] Completed
- [!] Blocked/Failed
**Result Outcome:**
- PASS: Fully Successful (all tests passed, no issues)
- WARN: Completed with Issues (needs follow-up)
- FAIL: Failed/Blocked
---
## 1. Implementation Overview
### Summary
Implement the 3-layer reward architecture in `server/reward.py` and wire it into `SQLEnvironment.step()`. Layer 1 provides operational signals (exec_ok, new_info, repeat penalty, step cost). Layer 2 computes progress-to-target for QUERY actions using a fixed weighted average of cardinality matching (0.25), value overlap (0.50), and numeric range proximity (0.25), binned to 5 levels with improvement-only gating. Layer 3 remains the existing terminal correctness signal. New reward-tracking fields are added to `EpisodeContext`, and `gold_rows` are cached at `reset()`. Existing tests that assert `reward=None` for non-terminal steps are updated.
### Scope
**In Scope:**
- `server/reward.py`: `compute_step_reward()`, Layer 1, Layer 2 with all sub-metrics, binning
- `models.py`: New fields on `EpisodeContext` (`gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, `cumulative_new_info_reward`)
- `server/sql_environment.py`: Wire `compute_step_reward()` into `step()`, store `gold_rows` at `reset()`
- Test updates for non-None step rewards
**Out of Scope:**
- F002 verifier integration (Layer 3 uses existing `_handle_answer`)
- Adaptive reward weights
- Row-wise best-match alignment
- NumPy/SciPy dependencies
---
## 1a. Execution Status
<!-- Auto-updated by /autocode-next-step - do not edit manually -->
**Progress:** 7/7 steps complete
**Current Step:** Finalization complete
**Last Updated:** 2026-03-28T06:05:02Z
**Latest Result:** PASS - Step 3.2 completed and final verification approved
**Blockers:** None
---
## 1b. Risk Assessment
**Risk Tier:** Low
**Risk Tier Definitions:**
- **Low:** Pure logic, non-user-facing, no security implications
- **Medium:** User input handling, data validation, API changes
- **High:** Authentication, payments, secrets management, untrusted input
**High-Risk Indicators Present:** None
**Security Review Required:** No
**Justification:**
Pure computation logic operating on in-memory data structures. No user input handling, no network I/O, no authentication. All inputs are already validated by the environment before reaching reward functions.
---
## 2. Change Manifest
### Files to Create
None (all files already exist).
### Files to Modify
| File | Changes |
|------|---------|
| `models.py` | Add 5 new fields to `EpisodeContext` dataclass |
| `server/reward.py` | Implement full reward module: `compute_step_reward`, Layer 1, Layer 2, sub-metrics, binning |
| `server/sql_environment.py` | Store `gold_rows` at `reset()`, call `compute_step_reward()` in `step()` |
| `tests/test_smoke.py` | Update assertions that expect `reward=None` for non-terminal steps |
### Files to Delete
None.
---
## 3. Interface Specifications
### Modified Types
```python
# Location: models.py
# CHANGE: Add reward-tracking fields to EpisodeContext
@dataclass
class EpisodeContext:
"""Per-episode server-side state (never sent to agent)."""
episode_id: str
db_connection: sqlite3.Connection
question_record: QuestionRecord
step_count: int = 0
budget: int = 15
described_tables: set[str] = dataclass_field(default_factory=set)
action_log: list[str] = dataclass_field(default_factory=list)
done: bool = False
gold_answer: str | None = None
# --- NEW fields for F003 ---
gold_rows: list[tuple] = dataclass_field(default_factory=list)
query_hashes: set[str] = dataclass_field(default_factory=set)
best_progress: float = 0.0
cumulative_step_reward: float = 0.0
cumulative_new_info_reward: float = 0.0
```
### New Functions
```python
# Location: server/reward.py
def compute_step_reward(
ctx: EpisodeContext,
action_type: str,
sql: str,
rows: list[tuple] | None,
error: str | None,
) -> float:
"""
Compute dense reward for a single non-terminal step.
Combines Layer 1 (operational) and Layer 2 (progress) signals.
Clamps running total of step rewards to [-0.2, +0.5].
Args:
ctx: Current episode context (mutated: updates tracking fields).
action_type: One of DESCRIBE, SAMPLE, QUERY.
sql: The SQL string executed (used for repeat detection).
rows: Result rows from query execution, or None if error.
error: Error message if action failed, else None.
Returns:
Step reward (float). Also updates ctx.cumulative_step_reward.
"""
def _layer1_operational(
ctx: EpisodeContext,
action_type: str,
sql: str,
rows: list[tuple] | None,
error: str | None,
) -> float:
"""
Layer 1: Operational reward signals.
Components:
- exec_ok: +0.02 if query executed without error
- new_info: +0.01 per new table discovered (capped at 0.10 cumulative)
- repeat: -0.01 if exact query hash seen before
- step_cost: -0.005 always
Args:
ctx: Episode context (mutated: updates query_hashes, cumulative_new_info_reward).
action_type: Action type string.
sql: SQL string for hash-based repeat detection.
rows: Result rows (used to confirm exec_ok).
error: Error message if action failed.
Returns:
Layer 1 reward component (float).
"""
def _layer2_progress(
ctx: EpisodeContext,
rows: list[tuple],
) -> float:
"""
Layer 2: Progress-to-target for QUERY actions only.
Computes weighted average of sub-metrics, bins to 5 levels,
rewards only improvement over best-so-far, scaled by 0.15.
Args:
ctx: Episode context (mutated: updates best_progress).
rows: Query result rows to compare against ctx.gold_rows.
Returns:
Layer 2 reward component (float). 0.0 if no improvement.
"""
def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
"""
Row count similarity: 1 - |len(pred) - len(gold)| / max(len(pred), len(gold), 1).
Returns:
Score in [0.0, 1.0].
"""
def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
"""
Jaccard overlap of flattened cell values (as strings).
Returns:
Score in [0.0, 1.0].
"""
def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
"""
Log-distance proximity for numeric cells.
For each numeric value in gold, find closest numeric in pred.
Score = mean(1 / (1 + log(1 + |pred - gold|))) across gold numerics.
Returns 1.0 if no numeric values in gold.
Returns:
Score in [0.0, 1.0].
"""
def _bin_progress(raw_score: float) -> float:
"""
Bin raw progress score to {0, 0.25, 0.5, 0.75, 1.0}.
Thresholds: [0, 0.125) -> 0, [0.125, 0.375) -> 0.25,
[0.375, 0.625) -> 0.5, [0.625, 0.875) -> 0.75, [0.875, 1.0] -> 1.0.
Returns:
Binned score.
"""
```
---
## 4. Data Flow
### Primary Flow (Non-terminal step with QUERY action)
```
1. step() receives action (QUERY, sql_string)
- Input: SQLAction with action_type="QUERY", argument=sql
2. step() dispatches to _handle_query(sql)
- Action: Executes SQL, returns formatted result
- Side effect: Stores raw rows internally
3. step() calls compute_step_reward(ctx, "QUERY", sql, rows, error)
- Input: episode context, action metadata, raw query rows
4. compute_step_reward calls _layer1_operational(ctx, "QUERY", sql, rows, None)
- Computes: exec_ok(+0.02) + new_info(+0.01 if new tables) + repeat(-0.01 if seen) + step_cost(-0.005)
- Side effect: Updates ctx.query_hashes, ctx.cumulative_new_info_reward
5. compute_step_reward calls _layer2_progress(ctx, rows)
- Computes: weighted avg of cardinality(0.25) + value_overlap(0.50) + numeric_range(0.25)
- Bins to {0, 0.25, 0.5, 0.75, 1.0}
- Returns improvement * 0.15 (only if binned > ctx.best_progress)
- Side effect: Updates ctx.best_progress
6. compute_step_reward clamps cumulative to [-0.2, +0.5]
- Output: clamped step reward (float)
- Side effect: Updates ctx.cumulative_step_reward
```
### Alternative Flows
**When action is DESCRIBE or SAMPLE:**
```
1. step() dispatches to _handle_describe() or _handle_sample()
2. compute_step_reward calls _layer1_operational only (Layer 2 skipped)
3. Clamping applied as usual
```
**When QUERY has SQL error:**
```
1. _handle_query raises sqlite3.Error
2. step() catches error, sets self._last_error
3. compute_step_reward called with error=str(exc), rows=None
4. Layer 1: step_cost only (-0.005), no exec_ok
5. Layer 2: skipped (rows is None)
```
**When gold_rows is empty:**
```
1. _layer2_progress detects ctx.gold_rows is empty
2. Returns 0.0 (skip Layer 2 entirely)
```
**When budget exhausted without ANSWER:**
```
1. step() sets done=True, reward=0.0 (terminal)
2. No compute_step_reward call for this terminal step
```
---
## 5. Error Handling
### Error Types
| Error | When | Impact |
|-------|------|--------|
| SQL execution error | Invalid query syntax / runtime error | Layer 1: step_cost only, Layer 2 skipped |
| Empty gold_rows | Gold SQL returned no rows | Layer 2 returns 0.0, Layer 1 operates normally |
| Division by zero in metrics | Both pred and gold are empty | Protected by `max(..., 1)` denominators |
### Error Handling Strategy
```python
# In compute_step_reward:
# - No exceptions should propagate; all edge cases return safe defaults
# - If error is not None, skip exec_ok and Layer 2
# - If rows is None, skip Layer 2
# - If gold_rows is empty, skip Layer 2
```
### Retry Strategy
| Operation | Retry? | Strategy |
|-----------|--------|----------|
| Reward computation | No | Pure function, deterministic, no I/O |
---
## 6. Slice Plan (What we will ship, in order)
### Slice S1 -- EpisodeContext Fields + Layer 1
**Value:** Every non-terminal step returns a small but meaningful reward signal based on operational quality
**User-visible change:** Yes -- step observations now include non-None reward values
**Interfaces introduced/changed:** 5 new fields on EpisodeContext, `compute_step_reward()`, `_layer1_operational()`
**Rollback safety:** Additive only -- new fields have defaults, reward.py is new code
### Slice S2 -- Layer 2 Progress Metrics
**Value:** QUERY actions receive progress-toward-answer signal, enabling convergent GRPO training
**User-visible change:** Yes -- QUERY step rewards now reflect closeness to gold answer
**Interfaces introduced/changed:** `_layer2_progress()`, `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, `_bin_progress()`
**Rollback safety:** Additive to reward.py, no external interface changes
### Slice S3 -- Wire into step() + Test Updates
**Value:** Full system integration -- environment returns dense rewards on every step
**User-visible change:** Yes -- complete dense reward signal in step observations
**Interfaces introduced/changed:** `sql_environment.py:step()` modified, `sql_environment.py:reset()` modified
**Rollback safety:** Reversible by removing compute_step_reward call from step()
---
## 7. Implementation Steps
> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.
### Step 1.1: Add reward-tracking fields to EpisodeContext
**Slice:** S1
**Goal:** Extend EpisodeContext with the 5 new fields required for reward tracking.
**Files:**
- `models.py` - modify - Add `gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, `cumulative_new_info_reward` fields
**Interface Changes:**
- `EpisodeContext` dataclass gains 5 new fields (all with defaults, backward-compatible)
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-27T23:51:47Z
**Changes Made:**
- `models.py`: Added `EpisodeContext` reward-tracking defaults for `gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, and `cumulative_new_info_reward`.
- `tests/unit/test_reward.py`: Added EpisodeContext-focused unit tests for new default fields and tuple-list `gold_rows` storage.
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"
Result: 6 passed in 3.92s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"`
- **Notes:**
- `tests/unit/test_reward.py` did not exist yet, so it was created to match verification spec coverage for EpisodeContext.
- Used `--with pytest` because bare `uv run pytest ...` fails in this repo due missing local pytest executable.
- Field additions are additive and backward compatible via defaults.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- EpisodeContext now has all fields needed by reward functions
---
### Step 1.2: Implement Layer 1 operational rewards
**Slice:** S1
**Goal:** Implement `_layer1_operational()` with exec_ok, new_info, repeat penalty, and step_cost signals.
**Files:**
- `server/reward.py` - modify - Implement `_layer1_operational()` function
**Interface Changes:**
- New function `_layer1_operational(ctx, action_type, sql, rows, error) -> float`
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-27T23:54:50Z
**Changes Made:**
- `server/reward.py`: Implemented `_layer1_operational()` with step cost, exec-ok signal, repeat-query penalty, and capped new-info accumulation tracked on `EpisodeContext`.
- `tests/unit/test_reward.py`: Added `TestLayer1Operational` coverage for successful actions, SQL error behavior, repeat penalties, and new-info cap behavior.
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"
Result: 8 passed, 6 deselected in 3.89s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"`
- **Notes:**
- `uv run pytest ...` still fails in this repo because `pytest` is not installed in the project environment; used `uv run --with pytest ...` to satisfy package-manager execution policy.
- Repeat detection uses SHA-256 of the exact SQL string and suppresses `exec_ok` on repeated successful QUERY actions.
- New-info reward is only granted on first-seen successful QUERY actions and is capped at 0.10 cumulative per episode.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Layer 1 operational shaping is complete and covered by unit tests; proceed with Layer 2 pure scoring helpers in `server/reward.py`.
---
### Step 2.1: Implement Layer 2 sub-metrics
**Slice:** S2
**Goal:** Implement `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, and `_bin_progress()`.
**Files:**
- `server/reward.py` - modify - Add all four sub-metric functions
**Interface Changes:**
- 4 new pure functions (no state mutation)
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-27T23:58:44Z
**Changes Made:**
- `server/reward.py`: Added pure Layer 2 helper functions `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, and `_bin_progress()` with bounded outputs and edge-case handling.
- `tests/unit/test_reward.py`: Added dedicated unit test coverage for all four sub-metrics, including boundary thresholds, empty inputs, mixed types, and numeric distance behavior.
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"
Result: 34 passed, 14 deselected in 5.06s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"`
- **Notes:**
- Implemented `_bin_progress()` with explicit clamping to `[0.0, 1.0]` before threshold binning.
- Numeric range scoring excludes booleans from numeric extraction to avoid `bool`/`int` coercion artifacts.
- All helpers are pure and deterministic, with no mutation of `EpisodeContext`.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Layer 2 helper metrics are now stable and tested; proceed to compose them in `_layer2_progress()` with weighted averaging and improvement-only gating.
---
### Step 2.2: Implement Layer 2 progress composition
**Slice:** S2
**Goal:** Implement `_layer2_progress()` that combines sub-metrics with fixed weights (0.25/0.50/0.25), bins, and gates on improvement.
**Files:**
- `server/reward.py` - modify - Add `_layer2_progress()` function
**Interface Changes:**
- New function `_layer2_progress(ctx, rows) -> float`
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-28T00:03:22Z
**Changes Made:**
- `server/reward.py`: Implemented `_layer2_progress()` using the fixed weighted composition (0.25/0.50/0.25), progress binning, improvement-only gating, and `ctx.best_progress` mutation on improvement.
- `tests/unit/test_reward.py`: Added `TestLayer2Progress` coverage for perfect match, no-improvement gating, incremental improvement rewards, empty-gold behavior, weighted-average outcome, best-progress updates, and non-downgrade behavior.
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"
Result: 7 passed, 48 deselected in 3.83s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"`
- **Notes:**
- Implemented explicit constants for Layer 2 weights and improvement scale to keep composition intent readable and stable.
- `_layer2_progress()` returns zero when `gold_rows` is empty and never reduces `ctx.best_progress`.
- `uv run pytest ...` still requires `--with pytest` in this repository due missing local pytest executable.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Layer 2 composition is now complete and tested; next implement `compute_step_reward()` to combine Layer 1 + Layer 2 and apply cumulative clamping.
---
### Step 2.3: Implement compute_step_reward with clamping
**Slice:** S2
**Goal:** Implement the main `compute_step_reward()` entry point that combines Layer 1 and Layer 2, applies clamping to [-0.2, +0.5].
**Files:**
- `server/reward.py` - modify - Add `compute_step_reward()` function
**Interface Changes:**
- New public function `compute_step_reward(ctx, action_type, sql, rows, error) -> float`
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-28T00:06:56Z
**Changes Made:**
- `server/reward.py`: Implemented `compute_step_reward()` to compose Layer 1 and (QUERY-only) Layer 2 signals, then clamp cumulative step shaping to `[-0.2, +0.5]` while returning the per-step clamped delta.
- `tests/unit/test_reward.py`: Added `TestComputeStepReward` coverage for query success/error paths, DESCRIBE/SAMPLE behavior, upper/lower clamp boundaries, clamp delta semantics, context mutation, and Layer 2 skip conditions.
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"
Result: 11 passed, 55 deselected in 3.84s
```
- **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"`
- **Notes:**
- `compute_step_reward()` now updates `ctx.cumulative_step_reward` through clamp-aware delta computation so boundaries are enforced deterministically.
- Layer 2 is only evaluated for successful `QUERY` actions (`rows is not None` and `error is None`) to keep non-query and error behavior aligned with spec.
- Verification command from spec (`-k "compute_step_reward"`) currently selects zero tests because test names use `compute_reward`; used `-k "compute_reward"` to execute the intended step suite.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Reward composition and clamp behavior are complete; next wire `compute_step_reward()` into environment `reset()`/`step()` flow and expose query rows for Layer 2 integration.
---
### Step 3.1: Wire reward into step() and reset()
**Slice:** S3
**Goal:** Store `gold_rows` in EpisodeContext at reset(). Call `compute_step_reward()` from step() for non-terminal actions. Expose raw query rows for Layer 2.
**Files:**
- `server/sql_environment.py` - modify - Update `reset()` to store gold_rows, update `step()` to call compute_step_reward, track raw query rows from `_handle_query`
**Interface Changes:**
- `reset()`: Stores `gold_rows` in EpisodeContext
- `step()`: Sets `self._last_reward` from `compute_step_reward()` for non-ANSWER actions
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-28T05:56:43Z
**Changes Made:**
- `server/sql_environment.py`: Imported `compute_step_reward` and wired dense reward calculation into `step()` for all non-terminal valid actions.
- `server/sql_environment.py`: Updated `_handle_query()` to return both formatted output and raw SQL rows so QUERY actions feed Layer 2 progress scoring.
- `server/sql_environment.py`: Preserved terminal budget behavior by skipping dense reward computation when the step exhausts budget (terminal reward remains `0.0`).
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"
Result: 26 passed, 40 deselected in 4.85s
Command: uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
Result: 5 passed, 20 deselected in 4.12s
```
- **Tests run:**
- `uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"`
- `uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"`
- **Notes:**
- Dense shaping now executes in the environment action loop for non-terminal steps while keeping ANSWER and budget-exhaustion terminal reward semantics unchanged.
- QUERY actions now pass raw rows through to reward computation; DESCRIBE/SAMPLE paths compute Layer 1-only reward.
- Used `uv run --with pytest ...` due local `uv run pytest ...` executable mismatch in this repository environment.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Existing smoke tests still assert `reward is None` for reset and non-terminal paths; update those assertions to match dense reward behavior.
---
### Step 3.2: Update existing tests for dense rewards
**Slice:** S3
**Goal:** Update tests in `tests/test_smoke.py` that assert `reward=None` for non-terminal steps to expect numeric reward values instead.
**Files:**
- `tests/test_smoke.py` - modify - Update reward assertions for non-terminal steps
**Interface Changes:**
- None (test-only changes)
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Status:** Completed
**Completed:** 2026-03-28T06:05:02Z
**Changes Made:**
- `tests/test_smoke.py`: Updated non-terminal action assertions to validate dense reward values instead of implicit `None` semantics.
- `tests/test_smoke.py`: Added concrete reward checks for DESCRIBE/SAMPLE (`0.015`), QUERY positive reward, non-SELECT QUERY penalty (`-0.005`), and first-step budget exhaustion reward behavior.
**Result:**
- **Outcome:** PASS
- **Evidence Captured:**
```
Command: uv run --with pytest pytest tests/test_smoke.py -v
Result: 25 passed in 4.04s
Command: uv run --with pytest pytest tests/ -v
Result: 166 passed, 1 skipped in 4.29s
Verifier: APPROVED (high confidence, no critical findings)
```
- **Tests run:**
- `uv run --with pytest pytest tests/test_smoke.py -v`
- `uv run --with pytest pytest tests/ -v`
- **Notes:**
- `uv run pytest ...` fails in this repository because `pytest` is not installed in the project environment; verification used `uv run --with pytest ...` while staying package-manager scoped.
- Assertions now align with dense-reward behavior and reinforce terminality checks via `done` rather than `reward is None` for non-terminal steps.
- Finalization included verifier approval, behavior-delta archival, and durable learning extraction.
- **Issues:** None
- **Follow-ups Created:** None
- **Human Review Completed:** N/A
**Context for Next Step:**
- Implementation steps are complete; proceed with `/commit-push-pr` when ready.
---
## 8. Rollout Considerations
### Feature Flags
- [x] Required: No
- [ ] Flag name: N/A
### Migration
- [x] Data migration needed: No
### Rollback Plan
Remove the `compute_step_reward()` call from `step()` and revert `self._last_reward = None` for non-ANSWER actions. The new EpisodeContext fields are harmless if unused.
---
## 9. Execution Tracking
All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history
---
## 9a. Slice Completion Protocol
After all steps in a slice pass verification:
1. **Run verifier subagent** for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
2. **Run compound-engineer subagent** to extract learnings
- **Mandatory invocation** after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
3. **Commit** the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
4. **Continue to next slice** (if more slices remain)
- Or proceed to final verification if all slices complete
**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.
---
## 10. User Value Summary
<!-- Populated by /autocode-next-step when final step completes -->
**Status:** Generated
### What Users Can Now Do
Agents now receive meaningful numeric reward feedback on every non-terminal SQL exploration step, not just terminal correctness at ANSWER time.
### How to Access/Test
Run a normal episode (`reset` then `DESCRIBE`/`SAMPLE`/`QUERY`) and observe per-step `observation.reward` values changing with execution quality and answer progress.
### Demo
- **Command:** `uv run --with pytest pytest tests/test_smoke.py -v`
- **Proof points:** DESCRIBE/SAMPLE rewards are `0.015`, invalid non-SELECT QUERY gets `-0.005`, QUERY returns positive dense reward, terminal budget-exhaustion still yields `0.0`.
### Release Notes Snippet
Dense 3-layer reward shaping is now fully integrated: all non-terminal actions emit numeric rewards, repeat/farming controls are enforced, progress-to-answer rewards are gated by improvement, and terminal correctness remains dominant.
---
## 11. PR Contract (Auto-Generated by autocode-next-step)
<!-- This section is auto-populated by autocode-next-step command when all steps complete -->
**Status:** Generated
### Scope Delivered
- Dense reward system implemented across `models.py`, `server/reward.py`, `server/sql_environment.py`, and test coverage updates in `tests/test_smoke.py` and `tests/unit/test_reward.py`.
- Final non-terminal reward assertions now match shipped behavior and protect against regressions.
### Verification Evidence
- `uv run --with pytest pytest tests/test_smoke.py -v` -> 25 passed
- `uv run --with pytest pytest tests/ -v` -> 166 passed, 1 skipped
- Verifier subagent verdict: approved (high confidence, no critical findings)
### Risks and Mitigations
- **Risk:** Legacy callers infer terminality from `reward is None`.
- **Mitigation:** Behavior spec now documents terminality contract based on `done`; smoke tests enforce non-terminal numeric rewards.
### Follow-up
- Ready for commit/PR via `/commit-push-pr`.
---
## Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
---
## Human Checkpoint
**Before handing to AI agent:**
- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated
**Questions:**
1. None
---
## Handoff Notes
**For the implementing AI agent:**
```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Key decisions already made:
- Layer 2 weights: 0.25 cardinality, 0.50 value overlap, 0.25 numeric range (fixed)
- gold_rows stored in EpisodeContext, populated at reset()
- Progress bins: {0, 0.25, 0.5, 0.75, 1.0}
- Clamping: [-0.2, +0.5] cumulative step reward
- Pure Python only, no numpy/scipy
```
---
*Specification completed: 2026-03-27*
*Verification input: specs/F003-VERIFICATION_INPUT.json*
*Target agent: Claude Code*