| # Implementation Specification |
|
|
| **Change:** F003 -- Dense Reward System (3-layer reward architecture) |
| **Date:** 2026-03-27 |
| **Research Summary:** [specs/F003-RESEARCH_SUMMARY.md](F003-RESEARCH_SUMMARY.md) |
| **Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner) |
| **Behavior Delta:** Archived to [specs/behavior/sql-environment.md](behavior/sql-environment.md) |
| **PR:** https://github.com/hjerpe/sql-env/pull/9 |
| |
| **Plan Status:** |
| - [x] Draft |
| - [x] Approved for Implementation |
| - [x] Implementation Complete |
| - [x] Verification Passed |
| |
| --- |
| |
| ## Core Intent (Immutable) |
| |
| > **DO NOT MODIFY THIS SECTION DURING REFINEMENT** |
| > Changes to Core Intent mean you're describing a different feature. |
| > If refinement reveals the need to change this section, create a new feature instead. |
| |
| **User Problem:** |
| Agents get meaningful feedback during exploration -- not just 0/1 at the end. A query that returns 40 when the answer is 42 gets partial credit. Discovering new schema info gets a small reward. This makes GRPO training converge. |
| |
| **Success Criteria:** |
| - Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3 |
| - Anti-gaming works: agent cannot farm rewards by repeating queries or describing everything |
| - Progress signal coarsened to 5 bins to prevent reward hill-climbing |
| |
| **Avoid:** |
| - Reward hacking (agent exploiting shaping signals to inflate reward without solving the task) |
| - Reward too sparse (no signal until terminal step defeats the purpose of dense rewards) |
| - Over-complex reward that is hard to debug (keep each layer simple and independently testable) |
| |
| **Out of Scope:** |
| - Adaptive/learned reward weights (use fixed weights: 0.25/0.50/0.25) |
| - Row-wise best-match alignment (add later if training shows need) |
| - NumPy/SciPy dependencies (pure Python only) |
| - Reward strategy classes or plugin architecture |
| - F002 verifier integration (Layer 3 uses existing naive check) |
| |
| --- |
| |
| ## 0. Slicing & Scope Budget (Anti-Waterfall) |
| |
| This spec must be executable in **small, mergeable increments**. |
| |
| ### Scope Budget |
| - Target: **3 slices** |
| - Hard max: **<= 10 steps total** |
| - Each step must end in: **implement -> verify -> merge** |
| |
| ### Slice Definition |
| A slice is a vertical increment that delivers user-visible value or a safe internal capability. |
| |
| **Each slice must have:** |
| - Clear outcome |
| - Minimal interface change |
| - Merge criteria |
| |
| **Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent). |
|
|
| ## Status Icons |
|
|
| **Step Status:** |
| - [ ] Not Started |
| - [~] In Progress |
| - [x] Completed |
| - [!] Blocked/Failed |
|
|
| **Result Outcome:** |
| - PASS: Fully Successful (all tests passed, no issues) |
| - WARN: Completed with Issues (needs follow-up) |
| - FAIL: Failed/Blocked |
|
|
| --- |
|
|
| ## 1. Implementation Overview |
|
|
| ### Summary |
|
|
| Implement the 3-layer reward architecture in `server/reward.py` and wire it into `SQLEnvironment.step()`. Layer 1 provides operational signals (exec_ok, new_info, repeat penalty, step cost). Layer 2 computes progress-to-target for QUERY actions using a fixed weighted average of cardinality matching (0.25), value overlap (0.50), and numeric range proximity (0.25), binned to 5 levels with improvement-only gating. Layer 3 remains the existing terminal correctness signal. New reward-tracking fields are added to `EpisodeContext`, and `gold_rows` are cached at `reset()`. Existing tests that assert `reward=None` for non-terminal steps are updated. |
|
|
| ### Scope |
|
|
| **In Scope:** |
| - `server/reward.py`: `compute_step_reward()`, Layer 1, Layer 2 with all sub-metrics, binning |
| - `models.py`: New fields on `EpisodeContext` (`gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, `cumulative_new_info_reward`) |
| - `server/sql_environment.py`: Wire `compute_step_reward()` into `step()`, store `gold_rows` at `reset()` |
| - Test updates for non-None step rewards |
|
|
| **Out of Scope:** |
| - F002 verifier integration (Layer 3 uses existing `_handle_answer`) |
| - Adaptive reward weights |
| - Row-wise best-match alignment |
| - NumPy/SciPy dependencies |
|
|
| --- |
|
|
| ## 1a. Execution Status |
| <!-- Auto-updated by /autocode-next-step - do not edit manually --> |
|
|
| **Progress:** 7/7 steps complete |
| **Current Step:** Finalization complete |
| **Last Updated:** 2026-03-28T06:05:02Z |
| **Latest Result:** PASS - Step 3.2 completed and final verification approved |
| **Blockers:** None |
|
|
| --- |
|
|
| ## 1b. Risk Assessment |
|
|
| **Risk Tier:** Low |
|
|
| **Risk Tier Definitions:** |
| - **Low:** Pure logic, non-user-facing, no security implications |
| - **Medium:** User input handling, data validation, API changes |
| - **High:** Authentication, payments, secrets management, untrusted input |
|
|
| **High-Risk Indicators Present:** None |
|
|
| **Security Review Required:** No |
|
|
| **Justification:** |
| Pure computation logic operating on in-memory data structures. No user input handling, no network I/O, no authentication. All inputs are already validated by the environment before reaching reward functions. |
|
|
| --- |
|
|
| ## 2. Change Manifest |
|
|
| ### Files to Create |
|
|
| None (all files already exist). |
|
|
| ### Files to Modify |
|
|
| | File | Changes | |
| |------|---------| |
| | `models.py` | Add 5 new fields to `EpisodeContext` dataclass | |
| | `server/reward.py` | Implement full reward module: `compute_step_reward`, Layer 1, Layer 2, sub-metrics, binning | |
| | `server/sql_environment.py` | Store `gold_rows` at `reset()`, call `compute_step_reward()` in `step()` | |
| | `tests/test_smoke.py` | Update assertions that expect `reward=None` for non-terminal steps | |
|
|
| ### Files to Delete |
|
|
| None. |
|
|
| --- |
|
|
| ## 3. Interface Specifications |
|
|
| ### Modified Types |
|
|
| ```python |
| # Location: models.py |
| # CHANGE: Add reward-tracking fields to EpisodeContext |
| |
| @dataclass |
| class EpisodeContext: |
| """Per-episode server-side state (never sent to agent).""" |
| |
| episode_id: str |
| db_connection: sqlite3.Connection |
| question_record: QuestionRecord |
| step_count: int = 0 |
| budget: int = 15 |
| described_tables: set[str] = dataclass_field(default_factory=set) |
| action_log: list[str] = dataclass_field(default_factory=list) |
| done: bool = False |
| gold_answer: str | None = None |
| # --- NEW fields for F003 --- |
| gold_rows: list[tuple] = dataclass_field(default_factory=list) |
| query_hashes: set[str] = dataclass_field(default_factory=set) |
| best_progress: float = 0.0 |
| cumulative_step_reward: float = 0.0 |
| cumulative_new_info_reward: float = 0.0 |
| ``` |
|
|
| ### New Functions |
|
|
| ```python |
| # Location: server/reward.py |
| |
| def compute_step_reward( |
| ctx: EpisodeContext, |
| action_type: str, |
| sql: str, |
| rows: list[tuple] | None, |
| error: str | None, |
| ) -> float: |
| """ |
| Compute dense reward for a single non-terminal step. |
| |
| Combines Layer 1 (operational) and Layer 2 (progress) signals. |
| Clamps running total of step rewards to [-0.2, +0.5]. |
| |
| Args: |
| ctx: Current episode context (mutated: updates tracking fields). |
| action_type: One of DESCRIBE, SAMPLE, QUERY. |
| sql: The SQL string executed (used for repeat detection). |
| rows: Result rows from query execution, or None if error. |
| error: Error message if action failed, else None. |
| |
| Returns: |
| Step reward (float). Also updates ctx.cumulative_step_reward. |
| """ |
| |
| |
| def _layer1_operational( |
| ctx: EpisodeContext, |
| action_type: str, |
| sql: str, |
| rows: list[tuple] | None, |
| error: str | None, |
| ) -> float: |
| """ |
| Layer 1: Operational reward signals. |
| |
| Components: |
| - exec_ok: +0.02 if query executed without error |
| - new_info: +0.01 per new table discovered (capped at 0.10 cumulative) |
| - repeat: -0.01 if exact query hash seen before |
| - step_cost: -0.005 always |
| |
| Args: |
| ctx: Episode context (mutated: updates query_hashes, cumulative_new_info_reward). |
| action_type: Action type string. |
| sql: SQL string for hash-based repeat detection. |
| rows: Result rows (used to confirm exec_ok). |
| error: Error message if action failed. |
| |
| Returns: |
| Layer 1 reward component (float). |
| """ |
| |
| |
| def _layer2_progress( |
| ctx: EpisodeContext, |
| rows: list[tuple], |
| ) -> float: |
| """ |
| Layer 2: Progress-to-target for QUERY actions only. |
| |
| Computes weighted average of sub-metrics, bins to 5 levels, |
| rewards only improvement over best-so-far, scaled by 0.15. |
| |
| Args: |
| ctx: Episode context (mutated: updates best_progress). |
| rows: Query result rows to compare against ctx.gold_rows. |
| |
| Returns: |
| Layer 2 reward component (float). 0.0 if no improvement. |
| """ |
| |
| |
| def _cardinality_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float: |
| """ |
| Row count similarity: 1 - |len(pred) - len(gold)| / max(len(pred), len(gold), 1). |
| |
| Returns: |
| Score in [0.0, 1.0]. |
| """ |
| |
| |
| def _value_overlap_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float: |
| """ |
| Jaccard overlap of flattened cell values (as strings). |
| |
| Returns: |
| Score in [0.0, 1.0]. |
| """ |
| |
| |
| def _numeric_range_score(pred_rows: list[tuple], gold_rows: list[tuple]) -> float: |
| """ |
| Log-distance proximity for numeric cells. |
| |
| For each numeric value in gold, find closest numeric in pred. |
| Score = mean(1 / (1 + log(1 + |pred - gold|))) across gold numerics. |
| Returns 1.0 if no numeric values in gold. |
| |
| Returns: |
| Score in [0.0, 1.0]. |
| """ |
| |
| |
| def _bin_progress(raw_score: float) -> float: |
| """ |
| Bin raw progress score to {0, 0.25, 0.5, 0.75, 1.0}. |
| |
| Thresholds: [0, 0.125) -> 0, [0.125, 0.375) -> 0.25, |
| [0.375, 0.625) -> 0.5, [0.625, 0.875) -> 0.75, [0.875, 1.0] -> 1.0. |
| |
| Returns: |
| Binned score. |
| """ |
| ``` |
|
|
| --- |
|
|
| ## 4. Data Flow |
|
|
| ### Primary Flow (Non-terminal step with QUERY action) |
|
|
| ``` |
| 1. step() receives action (QUERY, sql_string) |
| - Input: SQLAction with action_type="QUERY", argument=sql |
| |
| 2. step() dispatches to _handle_query(sql) |
| - Action: Executes SQL, returns formatted result |
| - Side effect: Stores raw rows internally |
| |
| 3. step() calls compute_step_reward(ctx, "QUERY", sql, rows, error) |
| - Input: episode context, action metadata, raw query rows |
| |
| 4. compute_step_reward calls _layer1_operational(ctx, "QUERY", sql, rows, None) |
| - Computes: exec_ok(+0.02) + new_info(+0.01 if new tables) + repeat(-0.01 if seen) + step_cost(-0.005) |
| - Side effect: Updates ctx.query_hashes, ctx.cumulative_new_info_reward |
| |
| 5. compute_step_reward calls _layer2_progress(ctx, rows) |
| - Computes: weighted avg of cardinality(0.25) + value_overlap(0.50) + numeric_range(0.25) |
| - Bins to {0, 0.25, 0.5, 0.75, 1.0} |
| - Returns improvement * 0.15 (only if binned > ctx.best_progress) |
| - Side effect: Updates ctx.best_progress |
| |
| 6. compute_step_reward clamps cumulative to [-0.2, +0.5] |
| - Output: clamped step reward (float) |
| - Side effect: Updates ctx.cumulative_step_reward |
| ``` |
|
|
| ### Alternative Flows |
|
|
| **When action is DESCRIBE or SAMPLE:** |
| ``` |
| 1. step() dispatches to _handle_describe() or _handle_sample() |
| 2. compute_step_reward calls _layer1_operational only (Layer 2 skipped) |
| 3. Clamping applied as usual |
| ``` |
|
|
| **When QUERY has SQL error:** |
| ``` |
| 1. _handle_query raises sqlite3.Error |
| 2. step() catches error, sets self._last_error |
| 3. compute_step_reward called with error=str(exc), rows=None |
| 4. Layer 1: step_cost only (-0.005), no exec_ok |
| 5. Layer 2: skipped (rows is None) |
| ``` |
|
|
| **When gold_rows is empty:** |
| ``` |
| 1. _layer2_progress detects ctx.gold_rows is empty |
| 2. Returns 0.0 (skip Layer 2 entirely) |
| ``` |
| |
| **When budget exhausted without ANSWER:** |
| ``` |
| 1. step() sets done=True, reward=0.0 (terminal) |
| 2. No compute_step_reward call for this terminal step |
| ``` |
|
|
| --- |
|
|
| ## 5. Error Handling |
|
|
| ### Error Types |
|
|
| | Error | When | Impact | |
| |-------|------|--------| |
| | SQL execution error | Invalid query syntax / runtime error | Layer 1: step_cost only, Layer 2 skipped | |
| | Empty gold_rows | Gold SQL returned no rows | Layer 2 returns 0.0, Layer 1 operates normally | |
| | Division by zero in metrics | Both pred and gold are empty | Protected by `max(..., 1)` denominators | |
|
|
| ### Error Handling Strategy |
|
|
| ```python |
| # In compute_step_reward: |
| # - No exceptions should propagate; all edge cases return safe defaults |
| # - If error is not None, skip exec_ok and Layer 2 |
| # - If rows is None, skip Layer 2 |
| # - If gold_rows is empty, skip Layer 2 |
| ``` |
|
|
| ### Retry Strategy |
|
|
| | Operation | Retry? | Strategy | |
| |-----------|--------|----------| |
| | Reward computation | No | Pure function, deterministic, no I/O | |
|
|
| --- |
|
|
| ## 6. Slice Plan (What we will ship, in order) |
|
|
| ### Slice S1 -- EpisodeContext Fields + Layer 1 |
| **Value:** Every non-terminal step returns a small but meaningful reward signal based on operational quality |
| **User-visible change:** Yes -- step observations now include non-None reward values |
| **Interfaces introduced/changed:** 5 new fields on EpisodeContext, `compute_step_reward()`, `_layer1_operational()` |
| **Rollback safety:** Additive only -- new fields have defaults, reward.py is new code |
|
|
| ### Slice S2 -- Layer 2 Progress Metrics |
| **Value:** QUERY actions receive progress-toward-answer signal, enabling convergent GRPO training |
| **User-visible change:** Yes -- QUERY step rewards now reflect closeness to gold answer |
| **Interfaces introduced/changed:** `_layer2_progress()`, `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, `_bin_progress()` |
| **Rollback safety:** Additive to reward.py, no external interface changes |
|
|
| ### Slice S3 -- Wire into step() + Test Updates |
| **Value:** Full system integration -- environment returns dense rewards on every step |
| **User-visible change:** Yes -- complete dense reward signal in step observations |
| **Interfaces introduced/changed:** `sql_environment.py:step()` modified, `sql_environment.py:reset()` modified |
| **Rollback safety:** Reversible by removing compute_step_reward call from step() |
|
|
| --- |
|
|
| ## 7. Implementation Steps |
|
|
| > **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md. |
| > The verification-planner (separate agent) generated independent test criteria. |
| > Run the tests specified there after implementing each step. |
| |
| ### Step 1.1: Add reward-tracking fields to EpisodeContext |
| **Slice:** S1 |
| **Goal:** Extend EpisodeContext with the 5 new fields required for reward tracking. |
| |
| **Files:** |
| - `models.py` - modify - Add `gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, `cumulative_new_info_reward` fields |
| |
| **Interface Changes:** |
| - `EpisodeContext` dataclass gains 5 new fields (all with defaults, backward-compatible) |
| |
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
|
|
| **Risk Tier for This Step:** Low |
|
|
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
| |
| **Status:** Completed |
| |
| **Completed:** 2026-03-27T23:51:47Z |
| **Changes Made:** |
| - `models.py`: Added `EpisodeContext` reward-tracking defaults for `gold_rows`, `query_hashes`, `best_progress`, `cumulative_step_reward`, and `cumulative_new_info_reward`. |
| - `tests/unit/test_reward.py`: Added EpisodeContext-focused unit tests for new default fields and tuple-list `gold_rows` storage. |
| |
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext" |
| Result: 6 passed in 3.92s |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "EpisodeContext"` |
| - **Notes:** |
| - `tests/unit/test_reward.py` did not exist yet, so it was created to match verification spec coverage for EpisodeContext. |
| - Used `--with pytest` because bare `uv run pytest ...` fails in this repo due missing local pytest executable. |
| - Field additions are additive and backward compatible via defaults. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
| |
| **Context for Next Step:** |
| - EpisodeContext now has all fields needed by reward functions |
| |
| --- |
| |
| ### Step 1.2: Implement Layer 1 operational rewards |
| **Slice:** S1 |
| **Goal:** Implement `_layer1_operational()` with exec_ok, new_info, repeat penalty, and step_cost signals. |
| |
| **Files:** |
| - `server/reward.py` - modify - Implement `_layer1_operational()` function |
| |
| **Interface Changes:** |
| - New function `_layer1_operational(ctx, action_type, sql, rows, error) -> float` |
| |
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
| |
| **Status:** Completed |
| |
| **Completed:** 2026-03-27T23:54:50Z |
| **Changes Made:** |
| - `server/reward.py`: Implemented `_layer1_operational()` with step cost, exec-ok signal, repeat-query penalty, and capped new-info accumulation tracked on `EpisodeContext`. |
| - `tests/unit/test_reward.py`: Added `TestLayer1Operational` coverage for successful actions, SQL error behavior, repeat penalties, and new-info cap behavior. |
| |
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1" |
| Result: 8 passed, 6 deselected in 3.89s |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer1"` |
| - **Notes:** |
| - `uv run pytest ...` still fails in this repo because `pytest` is not installed in the project environment; used `uv run --with pytest ...` to satisfy package-manager execution policy. |
| - Repeat detection uses SHA-256 of the exact SQL string and suppresses `exec_ok` on repeated successful QUERY actions. |
| - New-info reward is only granted on first-seen successful QUERY actions and is capped at 0.10 cumulative per episode. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Layer 1 operational shaping is complete and covered by unit tests; proceed with Layer 2 pure scoring helpers in `server/reward.py`. |
|
|
| --- |
|
|
| ### Step 2.1: Implement Layer 2 sub-metrics |
| **Slice:** S2 |
| **Goal:** Implement `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, and `_bin_progress()`. |
|
|
| **Files:** |
| - `server/reward.py` - modify - Add all four sub-metric functions |
|
|
| **Interface Changes:** |
| - 4 new pure functions (no state mutation) |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| **Completed:** 2026-03-27T23:58:44Z |
| **Changes Made:** |
| - `server/reward.py`: Added pure Layer 2 helper functions `_cardinality_score()`, `_value_overlap_score()`, `_numeric_range_score()`, and `_bin_progress()` with bounded outputs and edge-case handling. |
| - `tests/unit/test_reward.py`: Added dedicated unit test coverage for all four sub-metrics, including boundary thresholds, empty inputs, mixed types, and numeric distance behavior. |
|
|
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress" |
| Result: 34 passed, 14 deselected in 5.06s |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "cardinality or value_overlap or numeric_range or bin_progress"` |
| - **Notes:** |
| - Implemented `_bin_progress()` with explicit clamping to `[0.0, 1.0]` before threshold binning. |
| - Numeric range scoring excludes booleans from numeric extraction to avoid `bool`/`int` coercion artifacts. |
| - All helpers are pure and deterministic, with no mutation of `EpisodeContext`. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Layer 2 helper metrics are now stable and tested; proceed to compose them in `_layer2_progress()` with weighted averaging and improvement-only gating. |
|
|
| --- |
|
|
| ### Step 2.2: Implement Layer 2 progress composition |
| **Slice:** S2 |
| **Goal:** Implement `_layer2_progress()` that combines sub-metrics with fixed weights (0.25/0.50/0.25), bins, and gates on improvement. |
|
|
| **Files:** |
| - `server/reward.py` - modify - Add `_layer2_progress()` function |
|
|
| **Interface Changes:** |
| - New function `_layer2_progress(ctx, rows) -> float` |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| **Completed:** 2026-03-28T00:03:22Z |
| **Changes Made:** |
| - `server/reward.py`: Implemented `_layer2_progress()` using the fixed weighted composition (0.25/0.50/0.25), progress binning, improvement-only gating, and `ctx.best_progress` mutation on improvement. |
| - `tests/unit/test_reward.py`: Added `TestLayer2Progress` coverage for perfect match, no-improvement gating, incremental improvement rewards, empty-gold behavior, weighted-average outcome, best-progress updates, and non-downgrade behavior. |
|
|
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2" |
| Result: 7 passed, 48 deselected in 3.83s |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "layer2"` |
| - **Notes:** |
| - Implemented explicit constants for Layer 2 weights and improvement scale to keep composition intent readable and stable. |
| - `_layer2_progress()` returns zero when `gold_rows` is empty and never reduces `ctx.best_progress`. |
| - `uv run pytest ...` still requires `--with pytest` in this repository due missing local pytest executable. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Layer 2 composition is now complete and tested; next implement `compute_step_reward()` to combine Layer 1 + Layer 2 and apply cumulative clamping. |
|
|
| --- |
|
|
| ### Step 2.3: Implement compute_step_reward with clamping |
| **Slice:** S2 |
| **Goal:** Implement the main `compute_step_reward()` entry point that combines Layer 1 and Layer 2, applies clamping to [-0.2, +0.5]. |
|
|
| **Files:** |
| - `server/reward.py` - modify - Add `compute_step_reward()` function |
|
|
| **Interface Changes:** |
| - New public function `compute_step_reward(ctx, action_type, sql, rows, error) -> float` |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| **Completed:** 2026-03-28T00:06:56Z |
| **Changes Made:** |
| - `server/reward.py`: Implemented `compute_step_reward()` to compose Layer 1 and (QUERY-only) Layer 2 signals, then clamp cumulative step shaping to `[-0.2, +0.5]` while returning the per-step clamped delta. |
| - `tests/unit/test_reward.py`: Added `TestComputeStepReward` coverage for query success/error paths, DESCRIBE/SAMPLE behavior, upper/lower clamp boundaries, clamp delta semantics, context mutation, and Layer 2 skip conditions. |
|
|
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward" |
| Result: 11 passed, 55 deselected in 3.84s |
| ``` |
| - **Tests run:** `uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward"` |
| - **Notes:** |
| - `compute_step_reward()` now updates `ctx.cumulative_step_reward` through clamp-aware delta computation so boundaries are enforced deterministically. |
| - Layer 2 is only evaluated for successful `QUERY` actions (`rows is not None` and `error is None`) to keep non-query and error behavior aligned with spec. |
| - Verification command from spec (`-k "compute_step_reward"`) currently selects zero tests because test names use `compute_reward`; used `-k "compute_reward"` to execute the intended step suite. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Reward composition and clamp behavior are complete; next wire `compute_step_reward()` into environment `reset()`/`step()` flow and expose query rows for Layer 2 integration. |
|
|
| --- |
|
|
| ### Step 3.1: Wire reward into step() and reset() |
| **Slice:** S3 |
| **Goal:** Store `gold_rows` in EpisodeContext at reset(). Call `compute_step_reward()` from step() for non-terminal actions. Expose raw query rows for Layer 2. |
|
|
| **Files:** |
| - `server/sql_environment.py` - modify - Update `reset()` to store gold_rows, update `step()` to call compute_step_reward, track raw query rows from `_handle_query` |
| |
| **Interface Changes:** |
| - `reset()`: Stores `gold_rows` in EpisodeContext |
| - `step()`: Sets `self._last_reward` from `compute_step_reward()` for non-ANSWER actions |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| **Completed:** 2026-03-28T05:56:43Z |
| **Changes Made:** |
| - `server/sql_environment.py`: Imported `compute_step_reward` and wired dense reward calculation into `step()` for all non-terminal valid actions. |
| - `server/sql_environment.py`: Updated `_handle_query()` to return both formatted output and raw SQL rows so QUERY actions feed Layer 2 progress scoring. |
| - `server/sql_environment.py`: Preserved terminal budget behavior by skipping dense reward computation when the step exhausts budget (terminal reward remains `0.0`). |
|
|
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2" |
| Result: 26 passed, 40 deselected in 4.85s |
| |
| Command: uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error" |
| Result: 5 passed, 20 deselected in 4.12s |
| ``` |
| - **Tests run:** |
| - `uv run --with pytest pytest tests/unit/test_reward.py -v -k "compute_reward or layer1 or layer2"` |
| - `uv run --with pytest pytest tests/test_smoke.py -v -k "describe_reveals_columns_and_updates_schema or sample_and_query_success or query_rejects_non_select or budget_exhaustion_sets_done_and_zero_reward or query_timeout_returns_error"` |
| - **Notes:** |
| - Dense shaping now executes in the environment action loop for non-terminal steps while keeping ANSWER and budget-exhaustion terminal reward semantics unchanged. |
| - QUERY actions now pass raw rows through to reward computation; DESCRIBE/SAMPLE paths compute Layer 1-only reward. |
| - Used `uv run --with pytest ...` due local `uv run pytest ...` executable mismatch in this repository environment. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Existing smoke tests still assert `reward is None` for reset and non-terminal paths; update those assertions to match dense reward behavior. |
|
|
| --- |
|
|
| ### Step 3.2: Update existing tests for dense rewards |
| **Slice:** S3 |
| **Goal:** Update tests in `tests/test_smoke.py` that assert `reward=None` for non-terminal steps to expect numeric reward values instead. |
|
|
| **Files:** |
| - `tests/test_smoke.py` - modify - Update reward assertions for non-terminal steps |
|
|
| **Interface Changes:** |
| - None (test-only changes) |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Status:** Completed |
|
|
| **Completed:** 2026-03-28T06:05:02Z |
| **Changes Made:** |
| - `tests/test_smoke.py`: Updated non-terminal action assertions to validate dense reward values instead of implicit `None` semantics. |
| - `tests/test_smoke.py`: Added concrete reward checks for DESCRIBE/SAMPLE (`0.015`), QUERY positive reward, non-SELECT QUERY penalty (`-0.005`), and first-step budget exhaustion reward behavior. |
|
|
| **Result:** |
| - **Outcome:** PASS |
| - **Evidence Captured:** |
| ``` |
| Command: uv run --with pytest pytest tests/test_smoke.py -v |
| Result: 25 passed in 4.04s |
| |
| Command: uv run --with pytest pytest tests/ -v |
| Result: 166 passed, 1 skipped in 4.29s |
| |
| Verifier: APPROVED (high confidence, no critical findings) |
| ``` |
| - **Tests run:** |
| - `uv run --with pytest pytest tests/test_smoke.py -v` |
| - `uv run --with pytest pytest tests/ -v` |
| - **Notes:** |
| - `uv run pytest ...` fails in this repository because `pytest` is not installed in the project environment; verification used `uv run --with pytest ...` while staying package-manager scoped. |
| - Assertions now align with dense-reward behavior and reinforce terminality checks via `done` rather than `reward is None` for non-terminal steps. |
| - Finalization included verifier approval, behavior-delta archival, and durable learning extraction. |
| - **Issues:** None |
| - **Follow-ups Created:** None |
| - **Human Review Completed:** N/A |
|
|
| **Context for Next Step:** |
| - Implementation steps are complete; proceed with `/commit-push-pr` when ready. |
|
|
| --- |
|
|
| ## 8. Rollout Considerations |
|
|
| ### Feature Flags |
| - [x] Required: No |
| - [ ] Flag name: N/A |
|
|
| ### Migration |
| - [x] Data migration needed: No |
|
|
| ### Rollback Plan |
| Remove the `compute_step_reward()` call from `step()` and revert `self._last_reward = None` for non-ANSWER actions. The new EpisodeContext fields are harmless if unused. |
|
|
| --- |
|
|
| ## 9. Execution Tracking |
|
|
| All execution state is tracked within this document: |
| - **Section 1a:** Overall progress summary |
| - **Section 7:** Per-step completion details, test results, and handoff context |
| - **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run` |
| - **Git history:** Full audit trail of changes to this file |
|
|
| The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by: |
| - Checking Section 1a for summary |
| - Reviewing Section 7 for detailed step status |
| - Inspecting the feature's `progress` and `status` fields in `FEATURES.json` |
| - Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history |
|
|
| --- |
|
|
| ## 9a. Slice Completion Protocol |
|
|
| After all steps in a slice pass verification: |
|
|
| 1. **Run verifier subagent** for spec compliance |
| - Validates against VERIFICATION_SPEC.md criteria |
| - Ensures no TODOs or incomplete work in slice |
| |
| 2. **Run compound-engineer subagent** to extract learnings |
| - **Mandatory invocation** after every slice completion |
| - Updates CLAUDE.md Learnings section (if durable patterns found) |
| - May exit with "no update needed" (valid for routine work) |
| |
| 3. **Commit** the slice changes |
| - Follow commit message format in CLAUDE.md |
| - Each slice gets its own atomic commit |
| |
| 4. **Continue to next slice** (if more slices remain) |
| - Or proceed to final verification if all slices complete |
| |
| **Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready. |
| |
| --- |
| |
| ## 10. User Value Summary |
| |
| <!-- Populated by /autocode-next-step when final step completes --> |
| |
| **Status:** Generated |
| |
| ### What Users Can Now Do |
| Agents now receive meaningful numeric reward feedback on every non-terminal SQL exploration step, not just terminal correctness at ANSWER time. |
| |
| ### How to Access/Test |
| Run a normal episode (`reset` then `DESCRIBE`/`SAMPLE`/`QUERY`) and observe per-step `observation.reward` values changing with execution quality and answer progress. |
| |
| ### Demo |
| - **Command:** `uv run --with pytest pytest tests/test_smoke.py -v` |
| - **Proof points:** DESCRIBE/SAMPLE rewards are `0.015`, invalid non-SELECT QUERY gets `-0.005`, QUERY returns positive dense reward, terminal budget-exhaustion still yields `0.0`. |
|
|
| ### Release Notes Snippet |
| Dense 3-layer reward shaping is now fully integrated: all non-terminal actions emit numeric rewards, repeat/farming controls are enforced, progress-to-answer rewards are gated by improvement, and terminal correctness remains dominant. |
|
|
| --- |
|
|
| ## 11. PR Contract (Auto-Generated by autocode-next-step) |
|
|
| <!-- This section is auto-populated by autocode-next-step command when all steps complete --> |
|
|
| **Status:** Generated |
|
|
| ### Scope Delivered |
| - Dense reward system implemented across `models.py`, `server/reward.py`, `server/sql_environment.py`, and test coverage updates in `tests/test_smoke.py` and `tests/unit/test_reward.py`. |
| - Final non-terminal reward assertions now match shipped behavior and protect against regressions. |
|
|
| ### Verification Evidence |
| - `uv run --with pytest pytest tests/test_smoke.py -v` -> 25 passed |
| - `uv run --with pytest pytest tests/ -v` -> 166 passed, 1 skipped |
| - Verifier subagent verdict: approved (high confidence, no critical findings) |
|
|
| ### Risks and Mitigations |
| - **Risk:** Legacy callers infer terminality from `reward is None`. |
| - **Mitigation:** Behavior spec now documents terminality contract based on `done`; smoke tests enforce non-terminal numeric rewards. |
|
|
| ### Follow-up |
| - Ready for commit/PR via `/commit-push-pr`. |
|
|
| --- |
|
|
| ## Stop Conditions (When to Split This Spec) |
|
|
| Stop and create a new IMPLEMENTATION_SPEC if: |
| - A step requires touching more than **3 files** in unrelated areas |
| - You need to introduce **multiple new abstractions** "just in case" |
| - Verification cannot be made targeted and concrete |
| - You discover new unknowns that change the plan materially |
| - The next slice cannot be merged safely without finishing later slices |
| |
| When splitting, ensure the current slice ends in a merged, stable state. |
| |
| --- |
| |
| ## Human Checkpoint |
| |
| **Before handing to AI agent:** |
| |
| - [ ] Interface specifications are complete |
| - [ ] Data flow is accurate |
| - [ ] Error handling is specified |
| - [ ] Implementation order makes sense |
| - [ ] VERIFICATION_SPEC.md has been generated |
|
|
| **Questions:** |
| 1. None |
|
|
| --- |
|
|
| ## Handoff Notes |
|
|
| **For the implementing AI agent:** |
|
|
| ``` |
| Context: See RESEARCH_SUMMARY.md for system understanding |
| Spec: Follow this document exactly |
| Verification: Use tests from VERIFICATION_SPEC.md (independent agent) |
| Ambiguity: Stop and ask rather than assume |
| Order: Follow implementation order exactly |
| Key decisions already made: |
| - Layer 2 weights: 0.25 cardinality, 0.50 value overlap, 0.25 numeric range (fixed) |
| - gold_rows stored in EpisodeContext, populated at reset() |
| - Progress bins: {0, 0.25, 0.5, 0.75, 1.0} |
| - Clamping: [-0.2, +0.5] cumulative step reward |
| - Pure Python only, no numpy/scipy |
| ``` |
|
|
| --- |
|
|
| *Specification completed: 2026-03-27* |
| *Verification input: specs/F003-VERIFICATION_INPUT.json* |
| *Target agent: Claude Code* |
|
|