Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F003-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified about 2 months ago

preview code

raw

history blame contribute delete

35.2 kB

	# Implementation Specification

	Change: F003 -- Dense Reward System (3-layer reward architecture)
	Date: 2026-03-27
	Research Summary: [specs/F003-RESEARCH_SUMMARY.md](F003-RESEARCH_SUMMARY.md)
	Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
	Behavior Delta: Archived to [specs/behavior/sql-environment.md](behavior/sql-environment.md)
	PR: https://github.com/hjerpe/sql-env/pull/9

	Plan Status:
	- [x] Draft
	- [x] Approved for Implementation
	- [x] Implementation Complete
	- [x] Verification Passed

	---

	## Core Intent (Immutable)

	> DO NOT MODIFY THIS SECTION DURING REFINEMENT
	> Changes to Core Intent mean you're describing a different feature.
	> If refinement reveals the need to change this section, create a new feature instead.

	User Problem:
	Agents get meaningful feedback during exploration -- not just 0/1 at the end. A query that returns 40 when the answer is 42 gets partial credit. Discovering new schema info gets a small reward. This makes GRPO training converge.

	Success Criteria:
	- Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3
	- Anti-gaming works: agent cannot farm rewards by repeating queries or describing everything
	- Progress signal coarsened to 5 bins to prevent reward hill-climbing

	Avoid:
	- Reward hacking (agent exploiting shaping signals to inflate reward without solving the task)
	- Reward too sparse (no signal until terminal step defeats the purpose of dense rewards)
	- Over-complex reward that is hard to debug (keep each layer simple and independently testable)

	Out of Scope:
	- Adaptive/learned reward weights (use fixed weights: 0.25/0.50/0.25)
	- Row-wise best-match alignment (add later if training shows need)
	- NumPy/SciPy dependencies (pure Python only)
	- Reward strategy classes or plugin architecture
	- F002 verifier integration (Layer 3 uses existing naive check)

	---

	## 0. Slicing & Scope Budget (Anti-Waterfall)

	This spec must be executable in small, mergeable increments.

	### Scope Budget
	- Target: 3 slices
	- Hard max: <= 10 steps total
	- Each step must end in: implement -> verify -> merge

	### Slice Definition
	A slice is a vertical increment that delivers user-visible value or a safe internal capability.

	Each slice must have:
	- Clear outcome
	- Minimal interface change
	- Merge criteria

	Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

	## Status Icons

	Step Status:
	- [ ] Not Started
	- [~] In Progress
	- [x] Completed
	- [!] Blocked/Failed

	Result Outcome:
	- PASS: Fully Successful (all tests passed, no issues)
	- WARN: Completed with Issues (needs follow-up)
	- FAIL: Failed/Blocked

	---

	## 1. Implementation Overview

	### Summary

	Implement the 3-layer reward architecture in `server/reward.py` and wire it into `SQLEnvironment.step()`. Layer 1 provides operational signals (exec_ok, new_info, repeat penalty, step cost). Layer 2 computes progress-to-target for QUERY actions using a fixed weighted average of cardinality matching (0.25), value overlap (0.50), and numeric range proximity (0.25), binned to 5 levels with improvement-only gating. Layer 3 remains the existing terminal correctness signal. New reward-tracking fields are added to `EpisodeContext`, and `gold_rows` are cached at `reset()`. Existing tests that assert `reward=None` for non-terminal steps are updated.

	### Scope

	In Scope:
	- `server/reward.py`: `compute_step_reward()`, Layer 1, Layer 2 with all sub-metrics, binning
	- `models.py`: New fields on `EpisodeContext` (`gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, `cumulative_new_info_reward`)
	- `server/sql_environment.py`: Wire `compute_step_reward()` into `step()`, store `gold_rows` at `reset()`
	- Test updates for non-None step rewards

	Out of Scope:
	- F002 verifier integration (Layer 3 uses existing `_handle_answer`)
	- Adaptive reward weights
	- Row-wise best-match alignment
	- NumPy/SciPy dependencies

	---

	## 1a. Execution Status
	<!-- Auto-updated by /autocode-next-step - do not edit manually -->

	Progress: 7/7 steps complete
	Current Step: Finalization complete
	Last Updated: 2026-03-28T06:05:02Z
	Latest Result: PASS - Step 3.2 completed and final verification approved
	Blockers: None

	---

	## 1b. Risk Assessment

	Risk Tier: Low

	Risk Tier Definitions:
	- Low: Pure logic, non-user-facing, no security implications
	- Medium: User input handling, data validation, API changes
	- High: Authentication, payments, secrets management, untrusted input

	High-Risk Indicators Present: None

	Security Review Required: No

	Justification:
	Pure computation logic operating on in-memory data structures. No user input handling, no network I/O, no authentication. All inputs are already validated by the environment before reaching reward functions.

	---

	## 2. Change Manifest

	### Files to Create

	None (all files already exist).

	### Files to Modify

	\| File \| Changes \|
	\|------\|---------\|
	\| `models.py` \| Add 5 new fields to `EpisodeContext` dataclass \|
	\| `server/reward.py` \| Implement full reward module: `compute_step_reward`, Layer 1, Layer 2, sub-metrics, binning \|
	\| `server/sql_environment.py` \| Store `gold_rows` at `reset()`, call `compute_step_reward()` in `step()` \|
	\| `tests/test_smoke.py` \| Update assertions that expect `reward=None` for non-terminal steps \|

	### Files to Delete

	None.

	---

	## 3. Interface Specifications

	### Modified Types

	```python
	# Location: models.py
	# CHANGE: Add reward-tracking fields to EpisodeContext

	@dataclass
	class EpisodeContext:
	"""Per-episode server-side state (never sent to agent)."""

	episode_id: str
	db_connection: sqlite3.Connection
	question_record: QuestionRecord
	step_count: int = 0
	budget: int = 15
	described_tables: set[str] = dataclass_field(default_factory=set)
	action_log: list[str] = dataclass_field(default_factory=list)
	done: bool = False
	gold_answer: str \| None = None
	# --- NEW fields for F003 ---
	gold_rows: list[tuple] = dataclass_field(default_factory=list)
	query_hashes: set[str] = dataclass_field(default_factory=set)
	best_progress: float = 0.0
	cumulative_step_reward: float = 0.0
	cumulative_new_info_reward: float = 0.0
	```

	### New Functions

	```python
	# Location: server/reward.py

	def compute_step_reward(
	ctx: EpisodeContext,
	action_type: str,
	sql: str,
	rows: list[tuple] \| None,
	error: str \| None,
	) -> float:
	"""
	Compute dense reward for a single non-terminal step.

	Combines Layer 1 (operational) and Layer 2 (progress) signals.
	Clamps running total of step rewards to [-0.2, +0.5].

	Args:
	ctx: Current episode context (mutated: updates tracking fields).
	action_type: One of DESCRIBE, SAMPLE, QUERY.
	sql: The SQL string executed (used for repeat detection).
	rows: Result rows from query execution, or None if error.
	error: Error message if action failed, else None.

	Returns:
	Step reward (float). Also updates ctx.cumulative_step_reward.
	"""


	def _layer1_operational(
	ctx: EpisodeContext,
	action_type: str,
	sql: str,
	rows: list[tuple] \| None,
	error: str \| None,
	) -> float:
	"""
	Layer 1: Operational reward signals.

	Components:
	- exec_ok: +0.02 if query executed without error
	- new_info: +0.01 per new table discovered (capped at 0.10 cumulative)
	- repeat: -0.01 if exact query hash seen before
	- step_cost: -0.005 always

	Args:
	ctx: Episode context (mutated: updates query_hashes, cumulative_new_info_reward).
	action_type: Action type string.
	sql: SQL string for hash-based repeat detection.
	rows: Result rows (used to confirm exec_ok).
	error: Error message if action failed.

	Returns:
	Layer 1 reward component (float).
	"""


	def _layer2_progress(
	ctx: EpisodeContext,
	rows: list[tuple],
	) -> float:
	"""
	Layer 2: Progress-to-target for QUERY actions only.

	Computes weighted average of sub-metrics, bins to 5 levels,
	rewards only improvement over best-so-far, scaled by 0.15.

	Args:
	ctx: Episode context (mutated: updates best_progress).
	rows: Query result rows to compare against ctx.gold_rows.

	Returns:
	Layer 2 reward component (float). 0.0 if no improvement.
	"""


	def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
	"""
	Row count similarity: 1 - \|len(pred) - len(gold)\| / max(len(pred), len(gold), 1).

	Returns:
	Score in [0.0, 1.0].
	"""


	def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
	"""
	Jaccard overlap of flattened cell values (as strings).

	Returns:
	Score in [0.0, 1.0].
	"""


	def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float:
	"""
	Log-distance proximity for numeric cells.

	For each numeric value in gold, find closest numeric in pred.
	Score = mean(1 / (1 + log(1 + \|pred - gold\|))) across gold numerics.
	Returns 1.0 if no numeric values in gold.

	Returns:
	Score in [0.0, 1.0].
	"""


	def _bin_progress(raw_score: float) -> float:
	"""
	Bin raw progress score to {0, 0.25, 0.5, 0.75, 1.0}.

	Thresholds: [0, 0.125) -> 0, [0.125, 0.375) -> 0.25,
	[0.375, 0.625) -> 0.5, [0.625, 0.875) -> 0.75, [0.875, 1.0] -> 1.0.

	Returns:
	Binned score.
	"""
	```

	---

	## 4. Data Flow

	### Primary Flow (Non-terminal step with QUERY action)

	```
	1. step() receives action (QUERY, sql_string)
	- Input: SQLAction with action_type="QUERY", argument=sql

	2. step() dispatches to _handle_query(sql)
	- Action: Executes SQL, returns formatted result
	- Side effect: Stores raw rows internally

	3. step() calls compute_step_reward(ctx, "QUERY", sql, rows, error)
	- Input: episode context, action metadata, raw query rows

	4. compute_step_reward calls _layer1_operational(ctx, "QUERY", sql, rows, None)
	- Computes: exec_ok(+0.02) + new_info(+0.01 if new tables) + repeat(-0.01 if seen) + step_cost(-0.005)
	- Side effect: Updates ctx.query_hashes, ctx.cumulative_new_info_reward

	5. compute_step_reward calls _layer2_progress(ctx, rows)
	- Computes: weighted avg of cardinality(0.25) + value_overlap(0.50) + numeric_range(0.25)
	- Bins to {0, 0.25, 0.5, 0.75, 1.0}
	- Returns improvement * 0.15 (only if binned > ctx.best_progress)
	- Side effect: Updates ctx.best_progress

	6. compute_step_reward clamps cumulative to [-0.2, +0.5]
	- Output: clamped step reward (float)
	- Side effect: Updates ctx.cumulative_step_reward
	```

	### Alternative Flows

	When action is DESCRIBE or SAMPLE:
	```
	1. step() dispatches to _handle_describe() or _handle_sample()
	2. compute_step_reward calls _layer1_operational only (Layer 2 skipped)
	3. Clamping applied as usual
	```

	When QUERY has SQL error:
	```
	1. _handle_query raises sqlite3.Error
	2. step() catches error, sets self._last_error
	3. compute_step_reward called with error=str(exc), rows=None
	4. Layer 1: step_cost only (-0.005), no exec_ok
	5. Layer 2: skipped (rows is None)
	```

	When gold_rows is empty:
	```
	1. _layer2_progress detects ctx.gold_rows is empty
	2. Returns 0.0 (skip Layer 2 entirely)
	```

	When budget exhausted without ANSWER:
	```
	1. step() sets done=True, reward=0.0 (terminal)
	2. No compute_step_reward call for this terminal step
	```

	---

	## 5. Error Handling

	### Error Types

	\| Error \| When \| Impact \|
	\|-------\|------\|--------\|
	\| SQL execution error \| Invalid query syntax / runtime error \| Layer 1: step_cost only, Layer 2 skipped \|
	\| Empty gold_rows \| Gold SQL returned no rows \| Layer 2 returns 0.0, Layer 1 operates normally \|
	\| Division by zero in metrics \| Both pred and gold are empty \| Protected by `max(..., 1)` denominators \|

	### Error Handling Strategy

	```python
	# In compute_step_reward:
	# - No exceptions should propagate; all edge cases return safe defaults
	# - If error is not None, skip exec_ok and Layer 2
	# - If rows is None, skip Layer 2
	# - If gold_rows is empty, skip Layer 2
	```

	### Retry Strategy

	\| Operation \| Retry? \| Strategy \|
	\|-----------\|--------\|----------\|
	\| Reward computation \| No \| Pure function, deterministic, no I/O \|

	---

	## 6. Slice Plan (What we will ship, in order)

	### Slice S1 -- EpisodeContext Fields + Layer 1
	Value: Every non-terminal step returns a small but meaningful reward signal based on operational quality
	User-visible change: Yes -- step observations now include non-None reward values
	Interfaces introduced/changed: 5 new fields on EpisodeContext, `compute_step_reward()`, `_layer1_operational()`
	Rollback safety: Additive only -- new fields have defaults, reward.py is new code

	### Slice S2 -- Layer 2 Progress Metrics
	Value: QUERY actions receive progress-toward-answer signal, enabling convergent GRPO training
	User-visible change: Yes -- QUERY step rewards now reflect closeness to gold answer
	Interfaces introduced/changed: `_layer2_progress()`, `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, `_bin_progress()`
	Rollback safety: Additive to reward.py, no external interface changes

	### Slice S3 -- Wire into step() + Test Updates
	Value: Full system integration -- environment returns dense rewards on every step
	User-visible change: Yes -- complete dense reward signal in step observations
	Interfaces introduced/changed: `sql_environment.py:step()` modified, `sql_environment.py:reset()` modified
	Rollback safety: Reversible by removing compute_step_reward call from step()

	---

	## 7. Implementation Steps

	> VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md.
	> The verification-planner (separate agent) generated independent test criteria.
	> Run the tests specified there after implementing each step.

	### Step 1.1: Add reward-tracking fields to EpisodeContext
	Slice: S1
	Goal: Extend EpisodeContext with the 5 new fields required for reward tracking.

	Files:
	- `models.py` - modify - Add `gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, `cumulative_new_info_reward` fields

	Interface Changes:
	- `EpisodeContext` dataclass gains 5 new fields (all with defaults, backward-compatible)

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-27T23:51:47Z
	Changes Made:
	- `models.py`: Added `EpisodeContext` reward-tracking defaults for `gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, and `cumulative_new_info_reward`.
	- `tests/unit/test_reward.py`: Added EpisodeContext-focused unit tests for new default fields and tuple-list `gold_rows` storage.

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"
	Result: 6 passed in 3.92s
	```
	- Tests run: `uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"`
	- Notes:
	- `tests/unit/test_reward.py` did not exist yet, so it was created to match verification spec coverage for EpisodeContext.
	- Used `--with pytest` because bare `uv run pytest ...` fails in this repo due missing local pytest executable.
	- Field additions are additive and backward compatible via defaults.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- EpisodeContext now has all fields needed by reward functions

	---

	### Step 1.2: Implement Layer 1 operational rewards
	Slice: S1
	Goal: Implement `_layer1_operational()` with exec_ok, new_info, repeat penalty, and step_cost signals.

	Files:
	- `server/reward.py` - modify - Implement `_layer1_operational()` function

	Interface Changes:
	- New function `_layer1_operational(ctx, action_type, sql, rows, error) -> float`

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-27T23:54:50Z
	Changes Made:
	- `server/reward.py`: Implemented `_layer1_operational()` with step cost, exec-ok signal, repeat-query penalty, and capped new-info accumulation tracked on `EpisodeContext`.
	- `tests/unit/test_reward.py`: Added `TestLayer1Operational` coverage for successful actions, SQL error behavior, repeat penalties, and new-info cap behavior.

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"
	Result: 8 passed, 6 deselected in 3.89s
	```
	- Tests run: `uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"`
	- Notes:
	- `uv run pytest ...` still fails in this repo because `pytest` is not installed in the project environment; used `uv run --with pytest ...` to satisfy package-manager execution policy.
	- Repeat detection uses SHA-256 of the exact SQL string and suppresses `exec_ok` on repeated successful QUERY actions.
	- New-info reward is only granted on first-seen successful QUERY actions and is capped at 0.10 cumulative per episode.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Layer 1 operational shaping is complete and covered by unit tests; proceed with Layer 2 pure scoring helpers in `server/reward.py`.

	---

	### Step 2.1: Implement Layer 2 sub-metrics
	Slice: S2
	Goal: Implement `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, and `_bin_progress()`.

	Files:
	- `server/reward.py` - modify - Add all four sub-metric functions

	Interface Changes:
	- 4 new pure functions (no state mutation)

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-27T23:58:44Z
	Changes Made:
	- `server/reward.py`: Added pure Layer 2 helper functions `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, and `_bin_progress()` with bounded outputs and edge-case handling.
	- `tests/unit/test_reward.py`: Added dedicated unit test coverage for all four sub-metrics, including boundary thresholds, empty inputs, mixed types, and numeric distance behavior.

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"
	Result: 34 passed, 14 deselected in 5.06s
	```
	- Tests run: `uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"`
	- Notes:
	- Implemented `_bin_progress()` with explicit clamping to `[0.0, 1.0]` before threshold binning.
	- Numeric range scoring excludes booleans from numeric extraction to avoid `bool`/`int` coercion artifacts.
	- All helpers are pure and deterministic, with no mutation of `EpisodeContext`.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Layer 2 helper metrics are now stable and tested; proceed to compose them in `_layer2_progress()` with weighted averaging and improvement-only gating.

	---

	### Step 2.2: Implement Layer 2 progress composition
	Slice: S2
	Goal: Implement `_layer2_progress()` that combines sub-metrics with fixed weights (0.25/0.50/0.25), bins, and gates on improvement.

	Files:
	- `server/reward.py` - modify - Add `_layer2_progress()` function

	Interface Changes:
	- New function `_layer2_progress(ctx, rows) -> float`

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-28T00:03:22Z
	Changes Made:
	- `server/reward.py`: Implemented `_layer2_progress()` using the fixed weighted composition (0.25/0.50/0.25), progress binning, improvement-only gating, and `ctx.best_progress` mutation on improvement.
	- `tests/unit/test_reward.py`: Added `TestLayer2Progress` coverage for perfect match, no-improvement gating, incremental improvement rewards, empty-gold behavior, weighted-average outcome, best-progress updates, and non-downgrade behavior.

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"
	Result: 7 passed, 48 deselected in 3.83s
	```
	- Tests run: `uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"`
	- Notes:
	- Implemented explicit constants for Layer 2 weights and improvement scale to keep composition intent readable and stable.
	- `_layer2_progress()` returns zero when `gold_rows` is empty and never reduces `ctx.best_progress`.
	- `uv run pytest ...` still requires `--with pytest` in this repository due missing local pytest executable.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Layer 2 composition is now complete and tested; next implement `compute_step_reward()` to combine Layer 1 + Layer 2 and apply cumulative clamping.

	---

	### Step 2.3: Implement compute_step_reward with clamping
	Slice: S2
	Goal: Implement the main `compute_step_reward()` entry point that combines Layer 1 and Layer 2, applies clamping to [-0.2, +0.5].

	Files:
	- `server/reward.py` - modify - Add `compute_step_reward()` function

	Interface Changes:
	- New public function `compute_step_reward(ctx, action_type, sql, rows, error) -> float`

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-28T00:06:56Z
	Changes Made:
	- `server/reward.py`: Implemented `compute_step_reward()` to compose Layer 1 and (QUERY-only) Layer 2 signals, then clamp cumulative step shaping to `[-0.2, +0.5]` while returning the per-step clamped delta.
	- `tests/unit/test_reward.py`: Added `TestComputeStepReward` coverage for query success/error paths, DESCRIBE/SAMPLE behavior, upper/lower clamp boundaries, clamp delta semantics, context mutation, and Layer 2 skip conditions.

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"
	Result: 11 passed, 55 deselected in 3.84s
	```
	- Tests run: `uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"`
	- Notes:
	- `compute_step_reward()` now updates `ctx.cumulative_step_reward` through clamp-aware delta computation so boundaries are enforced deterministically.
	- Layer 2 is only evaluated for successful `QUERY` actions (`rows is not None` and `error is None`) to keep non-query and error behavior aligned with spec.
	- Verification command from spec (`-k "compute_step_reward"`) currently selects zero tests because test names use `compute_reward`; used `-k "compute_reward"` to execute the intended step suite.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Reward composition and clamp behavior are complete; next wire `compute_step_reward()` into environment `reset()`/`step()` flow and expose query rows for Layer 2 integration.

	---

	### Step 3.1: Wire reward into step() and reset()
	Slice: S3
	Goal: Store `gold_rows` in EpisodeContext at reset(). Call `compute_step_reward()` from step() for non-terminal actions. Expose raw query rows for Layer 2.

	Files:
	- `server/sql_environment.py` - modify - Update `reset()` to store gold_rows, update `step()` to call compute_step_reward, track raw query rows from `_handle_query`

	Interface Changes:
	- `reset()`: Stores `gold_rows` in EpisodeContext
	- `step()`: Sets `self._last_reward` from `compute_step_reward()` for non-ANSWER actions

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-28T05:56:43Z
	Changes Made:
	- `server/sql_environment.py`: Imported `compute_step_reward` and wired dense reward calculation into `step()` for all non-terminal valid actions.
	- `server/sql_environment.py`: Updated `_handle_query()` to return both formatted output and raw SQL rows so QUERY actions feed Layer 2 progress scoring.
	- `server/sql_environment.py`: Preserved terminal budget behavior by skipping dense reward computation when the step exhausts budget (terminal reward remains `0.0`).

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"
	Result: 26 passed, 40 deselected in 4.85s

	Command: uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"
	Result: 5 passed, 20 deselected in 4.12s
	```
	- Tests run:
	- `uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"`
	- `uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"`
	- Notes:
	- Dense shaping now executes in the environment action loop for non-terminal steps while keeping ANSWER and budget-exhaustion terminal reward semantics unchanged.
	- QUERY actions now pass raw rows through to reward computation; DESCRIBE/SAMPLE paths compute Layer 1-only reward.
	- Used `uv run --with pytest ...` due local `uv run pytest ...` executable mismatch in this repository environment.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Existing smoke tests still assert `reward is None` for reset and non-terminal paths; update those assertions to match dense reward behavior.

	---

	### Step 3.2: Update existing tests for dense rewards
	Slice: S3
	Goal: Update tests in `tests/test_smoke.py` that assert `reward=None` for non-terminal steps to expect numeric reward values instead.

	Files:
	- `tests/test_smoke.py` - modify - Update reward assertions for non-terminal steps

	Interface Changes:
	- None (test-only changes)

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Status: Completed

	Completed: 2026-03-28T06:05:02Z
	Changes Made:
	- `tests/test_smoke.py`: Updated non-terminal action assertions to validate dense reward values instead of implicit `None` semantics.
	- `tests/test_smoke.py`: Added concrete reward checks for DESCRIBE/SAMPLE (`0.015`), QUERY positive reward, non-SELECT QUERY penalty (`-0.005`), and first-step budget exhaustion reward behavior.

	Result:
	- Outcome: PASS
	- Evidence Captured:
	```
	Command: uv run --with pytest pytest tests/test_smoke.py -v
	Result: 25 passed in 4.04s

	Command: uv run --with pytest pytest tests/ -v
	Result: 166 passed, 1 skipped in 4.29s

	Verifier: APPROVED (high confidence, no critical findings)
	```
	- Tests run:
	- `uv run --with pytest pytest tests/test_smoke.py -v`
	- `uv run --with pytest pytest tests/ -v`
	- Notes:
	- `uv run pytest ...` fails in this repository because `pytest` is not installed in the project environment; verification used `uv run --with pytest ...` while staying package-manager scoped.
	- Assertions now align with dense-reward behavior and reinforce terminality checks via `done` rather than `reward is None` for non-terminal steps.
	- Finalization included verifier approval, behavior-delta archival, and durable learning extraction.
	- Issues: None
	- Follow-ups Created: None
	- Human Review Completed: N/A

	Context for Next Step:
	- Implementation steps are complete; proceed with `/commit-push-pr` when ready.

	---

	## 8. Rollout Considerations

	### Feature Flags
	- [x] Required: No
	- [ ] Flag name: N/A

	### Migration
	- [x] Data migration needed: No

	### Rollback Plan
	Remove the `compute_step_reward()` call from `step()` and revert `self._last_reward = None` for non-ANSWER actions. The new EpisodeContext fields are harmless if unused.

	---

	## 9. Execution Tracking

	All execution state is tracked within this document:
	- Section 1a: Overall progress summary
	- Section 7: Per-step completion details, test results, and handoff context
	- FEATURES.json: Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
	- Git history: Full audit trail of changes to this file

	The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
	- Checking Section 1a for summary
	- Reviewing Section 7 for detailed step status
	- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
	- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

	---

	## 9a. Slice Completion Protocol

	After all steps in a slice pass verification:

	1. Run verifier subagent for spec compliance
	- Validates against VERIFICATION_SPEC.md criteria
	- Ensures no TODOs or incomplete work in slice

	2. Run compound-engineer subagent to extract learnings
	- Mandatory invocation after every slice completion
	- Updates CLAUDE.md Learnings section (if durable patterns found)
	- May exit with "no update needed" (valid for routine work)

	3. Commit the slice changes
	- Follow commit message format in CLAUDE.md
	- Each slice gets its own atomic commit

	4. Continue to next slice (if more slices remain)
	- Or proceed to final verification if all slices complete

	Note: PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

	---

	## 10. User Value Summary

	<!-- Populated by /autocode-next-step when final step completes -->

	Status: Generated

	### What Users Can Now Do
	Agents now receive meaningful numeric reward feedback on every non-terminal SQL exploration step, not just terminal correctness at ANSWER time.

	### How to Access/Test
	Run a normal episode (`reset` then `DESCRIBE`/`SAMPLE`/`QUERY`) and observe per-step `observation.reward` values changing with execution quality and answer progress.

	### Demo
	- Command: `uv run --with pytest pytest tests/test_smoke.py -v`
	- Proof points: DESCRIBE/SAMPLE rewards are `0.015`, invalid non-SELECT QUERY gets `-0.005`, QUERY returns positive dense reward, terminal budget-exhaustion still yields `0.0`.

	### Release Notes Snippet
	Dense 3-layer reward shaping is now fully integrated: all non-terminal actions emit numeric rewards, repeat/farming controls are enforced, progress-to-answer rewards are gated by improvement, and terminal correctness remains dominant.

	---

	## 11. PR Contract (Auto-Generated by autocode-next-step)

	<!-- This section is auto-populated by autocode-next-step command when all steps complete -->

	Status: Generated

	### Scope Delivered
	- Dense reward system implemented across `models.py`, `server/reward.py`, `server/sql_environment.py`, and test coverage updates in `tests/test_smoke.py` and `tests/unit/test_reward.py`.
	- Final non-terminal reward assertions now match shipped behavior and protect against regressions.

	### Verification Evidence
	- `uv run --with pytest pytest tests/test_smoke.py -v` -> 25 passed
	- `uv run --with pytest pytest tests/ -v` -> 166 passed, 1 skipped
	- Verifier subagent verdict: approved (high confidence, no critical findings)

	### Risks and Mitigations
	- Risk: Legacy callers infer terminality from `reward is None`.
	- Mitigation: Behavior spec now documents terminality contract based on `done`; smoke tests enforce non-terminal numeric rewards.

	### Follow-up
	- Ready for commit/PR via `/commit-push-pr`.

	---

	## Stop Conditions (When to Split This Spec)

	Stop and create a new IMPLEMENTATION_SPEC if:
	- A step requires touching more than 3 files in unrelated areas
	- You need to introduce multiple new abstractions "just in case"
	- Verification cannot be made targeted and concrete
	- You discover new unknowns that change the plan materially
	- The next slice cannot be merged safely without finishing later slices

	When splitting, ensure the current slice ends in a merged, stable state.

	---

	## Human Checkpoint

	Before handing to AI agent:

	- [ ] Interface specifications are complete
	- [ ] Data flow is accurate
	- [ ] Error handling is specified
	- [ ] Implementation order makes sense
	- [ ] VERIFICATION_SPEC.md has been generated

	Questions:
	1. None

	---

	## Handoff Notes

	For the implementing AI agent:

	```
	Context: See RESEARCH_SUMMARY.md for system understanding
	Spec: Follow this document exactly
	Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
	Ambiguity: Stop and ask rather than assume
	Order: Follow implementation order exactly
	Key decisions already made:
	- Layer 2 weights: 0.25 cardinality, 0.50 value overlap, 0.25 numeric range (fixed)
	- gold_rows stored in EpisodeContext, populated at reset()
	- Progress bins: {0, 0.25, 0.5, 0.75, 1.0}
	- Clamping: [-0.2, +0.5] cumulative step reward
	- Pure Python only, no numpy/scipy
	```

	---

	Specification completed: 2026-03-27
	Verification input: specs/F003-VERIFICATION_INPUT.json
	Target agent: Claude Code