| # SQLEnv: Interactive Database Query Environment β v1 Spec |
|
|
| ## OpenEnv Challenge Submission Design & MVP Development Plan |
|
|
| --- |
|
|
| ## 1. What We're Building & Delivering |
|
|
| ### 1.1 One-Sentence Summary |
|
|
| SQLEnv is an RL environment where an agent answers natural language questions about databases through iterative SQL exploration β schema inspection, exploratory queries, result observation, and refinement β before submitting a final answer. |
|
|
| ### 1.2 Why It's Novel |
|
|
| No RL environment for interactive SQL exists. Text-to-SQL benchmarks (Spider, BIRD, WikiSQL) are static single-shot evaluations. SQLEnv transforms this into a multi-turn exploration problem where agents develop query strategies through trial, error, and feedback. This maps directly to how real data analysts work. |
|
|
| ### 1.3 Submission Artifacts (Mapped to Evaluation Criteria) |
|
|
| | Evaluation Criterion | Artifact | Description | |
| |---|---|---| |
| | **Creative and Robust use of OpenEnv** | SQLEnv environment on HF Hub | Full environment: models, server, client, Dockerfile, `openenv.yaml` | |
| | **Technical Excellence** | 3-layer reward architecture + multi-type answer verification | Dense stepwise reward (not just terminal), typed answer comparison | |
| | **Story-telling** | HuggingFace blog post | "Teaching AI to think like a data analyst" β untrained vs. trained agent | |
| | **Open Source Demo** | Training notebooks/scripts on GitHub | GRPO training script (TRL-compatible), baseline comparison | |
| | **Green Agent wrapper** | Green Agent class | Automated evaluation wrapper following OpenEnv pattern | |
|
|
| ### 1.4 What Exactly Gets Submitted |
|
|
| 1. **HF Hub Space**: Docker container running SQLEnv server (WebSocket API) |
| 2. **GitHub repo**: Environment source + GRPO training notebook + results |
| 3. **HF Blog post**: Narrative + learning curves + side-by-side demo (untrained vs. trained) |
|
|
| --- |
|
|
| ## 2. Environment Design |
|
|
| ### 2.1 OpenEnv Integration Architecture |
|
|
| ``` |
| envs/sql_env/ |
| βββ __init__.py |
| βββ models.py # SQLAction, SQLObservation (Pydantic) |
| βββ client.py # SQLEnv(EnvClient) β WebSocket client |
| βββ openenv.yaml # Environment manifest |
| βββ pyproject.toml |
| βββ uv.lock |
| βββ data/ # SQLite databases + question sets |
| β βββ databases/ # Spider DB files |
| β βββ questions/ # JSON question sets with gold answers |
| βββ server/ |
| βββ __init__.py |
| βββ app.py # create_app(SQLEnvironment, SQLAction, SQLObservation) |
| βββ environment.py # SQLEnvironment(Environment) β core logic |
| βββ reward.py # Reward computation (3 layers) |
| βββ verifier.py # Answer comparison (multi-type) |
| βββ requirements.txt |
| βββ Dockerfile |
| ``` |
|
|
| ### 2.2 Pydantic Models |
|
|
| ```python |
| # models.py |
| from pydantic import Field |
| from openenv.core.env_server.types import Action, Observation, State |
| |
| class SQLAction(Action): |
| """What the agent sends each step.""" |
| action_type: str = Field( |
| ..., |
| description="One of: DESCRIBE, SAMPLE, QUERY, ANSWER" |
| ) |
| argument: str = Field( |
| ..., |
| description="Table name (for DESCRIBE/SAMPLE), SQL string (for QUERY), or answer value (for ANSWER)" |
| ) |
| |
| class SQLObservation(Observation): |
| """What the agent receives after each step.""" |
| # Inherited: done (bool), reward (float | None) |
| question: str = Field(..., description="The NL question to answer") |
| schema_info: str = Field(..., description="Database schema description") |
| result: str = Field(default="", description="Result of the last action (truncated)") |
| error: str = Field(default="", description="Error message if action failed") |
| step_count: int = Field(default=0, description="Current step number") |
| budget_remaining: int = Field(default=0, description="Steps left before timeout") |
| action_history: list[str] = Field( |
| default_factory=list, |
| description="Summary of previous actions taken" |
| ) |
| ``` |
|
|
| **Design note**: `result` is a string, not raw data. Results are always truncated/summarized (max N rows as formatted text). This is intentional β the agent sees "what a real analyst would see", not the full database. This makes the environment a POMDP, which is appropriate for the task and beneficial for learning dynamics (see Section 3). |
|
|
| ### 2.3 State |
|
|
| Uses the core `State` class from OpenEnv (`episode_id` + `step_count`). No custom state needed for MVP. |
|
|
| ### 2.4 Action Space |
|
|
| | Action | Argument | Effect | Cost | |
| |---|---|---|---| |
| | `DESCRIBE` | table_name | Returns column names, types, row count | 1 step | |
| | `SAMPLE` | table_name | Returns 5 random rows (formatted text) | 1 step | |
| | `QUERY` | sql_string | Executes SQL, returns truncated results (max 20 rows) | 1 step | |
| | `ANSWER` | value | Submits final answer, ends episode | 0 steps (terminal) | |
| |
| **Step budget**: 15 steps per episode (configurable). This is enough for 2-3 exploration actions + 3-5 query attempts + answer. Keeps episodes short enough for efficient training. |
| |
| **Query sandboxing**: All SQL runs in a read-only SQLite connection with a statement timeout (5 seconds). Only SELECT statements allowed. No writes, no DDL, no pragmas. |
| |
| ### 2.5 Episode Lifecycle |
| |
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β reset() β |
| β β Pick random question from question set β |
| β β Load corresponding SQLite database (read-only) β |
| β β Return initial observation: β |
| β question, schema_info (table names only), budget=15 β |
| ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β step(action) loop β |
| β β |
| β DESCRIBE β return columns/types for requested table β |
| β SAMPLE β return 5 random rows from requested table β |
| β QUERY β execute SQL, return truncated result or error β |
| β ANSWER β compare to gold, compute terminal reward, done=T β |
| β β |
| β Each non-ANSWER step: compute stepwise reward, decrement β |
| β budget. If budget=0 and no ANSWER: done=True, reward=0 β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
| |
| **Important**: On `reset()`, the agent sees only **table names** in `schema_info`, not full column details. The agent must actively DESCRIBE tables to learn the schema. This is a deliberate design choice β it creates an exploration incentive and mirrors real-world "I have a database, what's in it?" workflows. |
|
|
| --- |
|
|
| ## 3. Reward Architecture |
|
|
| The reward research (SQL-TRAIL, PaVeRL-SQL, QueryGym, Graph-Reward-SQL) converges on one principle: **reward improvement during exploration, but make correctness dominate at termination.** |
|
|
| Our reward has 3 layers. The total episode reward is: |
|
|
| ``` |
| R_episode = R_terminal + sum(R_step_i) where sum(R_step_i) is capped at 0.5 |
| ``` |
|
|
| ### 3.1 Layer 1 β Operational Validity (every step, no gold reference needed) |
|
|
| These signals reward "being able to operate the tool" without referencing the gold answer. |
|
|
| | Signal | Value | When | |
| |---|---|---| |
| | `r_exec_ok` | +0.02 | A QUERY executes without error | |
| | `r_new_info` | +0.01 | Action reveals new info (new table described, new column seen). Capped at 0.10 per episode | |
| | `r_repeat` | -0.01 | Exact same SQL run again (hash match), or exact same DESCRIBE/SAMPLE repeated | |
| | `r_cost` | -0.005 | Every step (small constant, keeps trajectories short) | |
|
|
| **Design rationale**: Prefer small positive signals + "no reward" over heavy negative penalties. Strong negatives make agents risk-averse and destabilize training (PaVeRL-SQL). The step cost is tiny β just enough to prefer shorter trajectories when everything else is equal. |
|
|
| ### 3.2 Layer 2 β Progress-to-Target (QUERY steps only, oracle-based but coarsened) |
|
|
| After each QUERY, we compute how close the query result is to the gold answer, then reward only *improvement* over the best-so-far. |
|
|
| **Progress metric** (depends on answer type): |
|
|
| | Answer Type | Progress Computation | |
| |---|---| |
| | **Integer/Count** | `1 - min(1, abs(pred - gold) / max(1, abs(gold)))` | |
| | **Float/Average** | Same as integer but with tolerance (within 1% = 1.0) | |
| | **String/Name** | Exact match = 1.0, else 0.0 (too fragile to partially score) | |
| | **List/Set** | `Jaccard(pred_set, gold_set)` β set overlap | |
| | **Table** | `0.5 * column_overlap + 0.5 * row_sample_overlap` | |
|
|
| **Coarsening**: To prevent reward leakage (agent hill-climbing on the reward signal instead of reasoning), we bin progress into 5 levels: {0, 0.25, 0.5, 0.75, 1.0}. |
|
|
| **Improvement-only reward**: |
| ```python |
| progress_binned = bin_to_nearest([0, 0.25, 0.5, 0.75, 1.0], progress_raw) |
| r_progress = max(0, progress_binned - best_progress_so_far) * 0.15 |
| best_progress_so_far = max(best_progress_so_far, progress_binned) |
| ``` |
|
|
| This prevents agents from "farming" reward by oscillating between good and bad queries. |
|
|
| ### 3.3 Layer 3 β Terminal Correctness (ANSWER action only) |
|
|
| | Condition | Reward | |
| |---|---| |
| | Answer matches gold (type-appropriate comparison) | +1.0 | |
| | Answer does not match | 0.0 | |
| | Episode times out (budget exhausted, no ANSWER) | 0.0 | |
|
|
| Terminal correctness is always the dominant signal. With the cap on stepwise rewards (0.5 max), a correct answer is always worth at least 2x the maximum exploration reward. |
|
|
| ### 3.4 Total Reward Formula |
|
|
| ``` |
| R_episode = 1.0 * R_correct + clamp(sum(r_exec_ok + r_new_info + r_repeat + r_cost + r_progress), -0.2, 0.5) |
| ``` |
|
|
| The negative clamp at -0.2 prevents catastrophic negative episodes from destabilizing training. |
|
|
| ### 3.5 How This Maps to TRL's GRPOTrainer |
|
|
| Following the Wordle GRPO tutorial pattern, we expose reward components as separate `reward_funcs`: |
|
|
| ```python |
| # In the training script |
| trainer = GRPOTrainer( |
| model=model_name, |
| processing_class=tokenizer, |
| reward_funcs=[ |
| reward_correctness, # Terminal: 0.0 or 1.0 |
| reward_progress, # Cumulative progress improvement (0 to 0.15) |
| reward_operational, # Sum of exec_ok, new_info, repeat, cost signals |
| ], |
| train_dataset=dataset, |
| args=grpo_config, |
| rollout_func=rollout_func, |
| ) |
| ``` |
|
|
| The `rollout_func` runs a full episode against the SQLEnv server (via WebSocket), collects the observation stream, and computes per-component rewards. This lets TRL handle the weighting/normalization. |
|
|
| ### 3.6 Anti-Gaming Measures |
|
|
| | Risk | Mitigation | |
| |---|---| |
| | Agent hill-climbs on progress signal to infer gold answer | Coarsen progress to 5 bins; step rewards are small vs. terminal | |
| | Agent DESCRIBEs everything to harvest `r_new_info` | Cap `r_new_info` at 0.10 per episode | |
| | Agent submits random answers hoping for partial credit | No partial credit on ANSWER β binary correctness only | |
| | Agent runs many queries to accumulate `r_exec_ok` | Step cost + budget limit make this net-negative after ~10 queries | |
| | Agent repeats identical queries | Hash-based repeat detection, penalty applied | |
|
|
| --- |
|
|
| ## 4. Question Sourcing & Verification |
|
|
| ### 4.1 MVP: Spider Subset |
|
|
| **Why Spider**: Most-studied text-to-SQL benchmark, gold SQL available for all questions, existing tooling for test suite generation, clean SQLite databases included, well-understood difficulty levels. |
|
|
| **MVP question set**: 50-100 questions from Spider's dev set, selected to cover: |
| - Simple (SELECT/WHERE/COUNT): ~40% |
| - Medium (JOIN/GROUP BY): ~40% |
| - Hard (subqueries/HAVING/multi-step): ~20% |
|
|
| **Answer types to support in MVP**: |
| - Integer (counts, sums) |
| - Float (averages β with 1% tolerance) |
| - String (single value lookups) |
| - List (top-k results β order-insensitive set comparison) |
|
|
| ### 4.2 Answer Verification |
|
|
| ```python |
| def verify_answer(predicted, gold, answer_type: str) -> bool: |
| match answer_type: |
| case "integer": |
| return int(predicted) == int(gold) |
| case "float": |
| return abs(float(predicted) - float(gold)) / max(1, abs(float(gold))) < 0.01 |
| case "string": |
| return str(predicted).strip().lower() == str(gold).strip().lower() |
| case "list": |
| return set(normalize(predicted)) == set(normalize(gold)) |
| case "table": |
| return compare_tables(predicted, gold) # Column + row overlap |
| ``` |
|
|
| ### 4.3 Question Metadata Format |
|
|
| Each question in the JSON dataset includes: |
|
|
| ```json |
| { |
| "id": "spider_dev_042", |
| "question": "How many employees are in the Engineering department?", |
| "database": "company_db", |
| "gold_sql": "SELECT COUNT(*) FROM employees WHERE department = 'Engineering'", |
| "gold_answer": "42", |
| "answer_type": "integer", |
| "difficulty": "easy", |
| "tables_involved": ["employees"] |
| } |
| ``` |
|
|
| The `gold_sql` is used for progress computation (running the gold query on the DB to get the reference answer). The `gold_answer` is the cached expected result. Both are hidden from the agent. |
|
|
| ### 4.4 Multi-DB Verification (Post-MVP) |
|
|
| **MVP**: Single database per question. Accept correct result on the original DB. |
|
|
| **Post-MVP**: For each question, generate 1-2 variant databases with the same schema but different data. An answer must be correct across all variants. |
|
|
| Variant generation strategies (prioritized): |
| 1. **Irrelevant row injection** β add records outside the question's filter scope |
| 2. **Join multiplicity trap** β add duplicates in bridge tables |
| 3. **ID remap** β apply bijection to primary keys, update foreign keys |
|
|
| (See Section 6.2 for the full metamorphic testing backlog.) |
|
|
| --- |
|
|
| ## 5. MVP Development Track |
|
|
| The goal is to get a **working, submittable solution as fast as possible**, then improve iteratively. Each phase has a "done when" gate. Do not start the next phase until the current one passes its gate. |
|
|
| ### Phase 1: Scaffold & Stub (Day 1) |
|
|
| **What**: Run `openenv init sql_env`, customize the generated models/server/client stubs to match our Pydantic models. Get a Docker container that starts and responds to reset/step. |
|
|
| **Tasks**: |
| 1. `openenv init sql_env` |
| 2. Replace generated models with `SQLAction`, `SQLObservation` from Section 2.2 |
| 3. Implement stub `SQLEnvironment.reset()` β returns hardcoded observation |
| 4. Implement stub `SQLEnvironment.step()` β accepts action, returns hardcoded observation |
| 5. Implement `SQLEnv(EnvClient)` client with `_step_payload`, `_parse_result`, `_parse_state` |
| 6. `openenv build` and `openenv validate` |
|
|
| **Done when**: `openenv validate --verbose` passes. Client can connect, reset, step, and receive typed observations. |
|
|
| ### Phase 2: Core Loop with Terminal Reward (Days 2-4) |
|
|
| **What**: Wire up real SQLite databases, implement the action handlers, add terminal-only reward (binary correctness on ANSWER). This is a **submittable environment** β sparse reward, but functional. |
|
|
| **Tasks**: |
| 1. Download Spider dev databases (SQLite files) and a curated question set (30-50 questions) |
| 2. Implement `reset()`: pick random question, load DB, return initial observation with table names |
| 3. Implement DESCRIBE handler: query `sqlite_master` + `PRAGMA table_info` |
| 4. Implement SAMPLE handler: `SELECT * FROM {table} ORDER BY RANDOM() LIMIT 5` |
| 5. Implement QUERY handler: execute read-only SQL with timeout, truncate results to 20 rows |
| 6. Implement ANSWER handler: compare to gold answer using `verify_answer()`, return reward |
| 7. Add budget tracking (15 steps), timeout handling |
| 8. SQL sandboxing: read-only connection, statement timeout, no DDL |
|
|
| **Done when**: Can run a full episode manually β reset, DESCRIBE a table, run a query, submit an answer, get correct/incorrect reward. Works in Docker via WebSocket. |
|
|
| ### Phase 3: Dense Reward (Days 5-7) |
|
|
| **What**: Add reward Layers 1 and 2 (operational validity + progress-to-target). This makes the environment trainable β agents get feedback before the terminal step. |
|
|
| **Tasks**: |
| 1. Implement `reward.py` with the 3-layer reward computation |
| 2. Layer 1: Track executed queries (hash set for repeat detection), track schema exploration (set of described tables/columns), compute `r_exec_ok`, `r_new_info`, `r_repeat`, `r_cost` |
| 3. Layer 2: After each QUERY, run gold SQL on the DB, compare query result to gold result using the progress metric (type-dependent), bin to 5 levels, compute improvement-only reward |
| 4. Wire reward computation into `step()` β return stepwise reward in observation |
| 5. Add reward capping logic (sum of step rewards <= 0.5, negative floor at -0.2) |
| 6. Test: run episodes, verify reward signals are sensible |
|
|
| **Done when**: Reward signal varies meaningfully across different agent behaviors (random exploration gives some small positive reward; targeted queries give progress reward; correct answer gives terminal reward). |
|
|
| ### Phase 4: Training Pipeline (Days 8-12) |
|
|
| **What**: Implement the GRPO training script following the Wordle tutorial pattern. Train a small model. Produce baseline vs. trained comparison. |
|
|
| **Tasks**: |
| 1. Write `rollout_func` that plays full SQLEnv episodes via WebSocket client |
| 2. Design system prompt for the SQL agent (schema understanding, query strategy, answer formatting) |
| 3. Implement reward functions for TRL (`reward_correctness`, `reward_progress`, `reward_operational`) |
| 4. Set up training config (Qwen3-1.7B or similar small model, GRPO via TRL+vLLM) |
| 5. Run training (start small: 100 episodes, observe learning curves) |
| 6. Implement Green Agent wrapper (automated evaluation: run N episodes, report success rate) |
| 7. Produce comparison: random policy vs. trained model (success rate, avg steps, avg reward) |
| 8. Debug and iterate on reward weights if training doesn't converge |
|
|
| **Done when**: Trained model measurably outperforms random baseline on success rate. Training notebook runs end-to-end. Green Agent reports evaluation metrics. |
|
|
| ### Phase 5: Polish & Submit (Days 13-16) |
|
|
| **What**: Deploy to HF, write blog, prepare demo. |
|
|
| **Tasks**: |
| 1. `openenv push` to HuggingFace Spaces |
| 2. Clean up GitHub repo: README, requirements, training notebook, results |
| 3. Write HF blog post: |
| - Hook: "What if AI could learn to query databases like a data analyst?" |
| - Problem: Static benchmarks don't teach exploration strategy |
| - Solution: SQLEnv β interactive environment with dense reward |
| - Results: Learning curves, before/after comparison, example episodes |
| - Technical: Reward architecture, OpenEnv integration |
| 4. Record/screenshot side-by-side demo (untrained vs. trained agent) |
| 5. Final validation: someone else can `pip install`, connect to HF Space, run training notebook |
|
|
| **Done when**: All 3 submission artifacts are live (HF Space, GitHub repo, HF blog). Blog tells a compelling story with real results. |
|
|
| ### Phase Summary |
|
|
| | Phase | Days | Produces | Risk if Skipped | |
| |---|---|---|---| |
| | 1. Scaffold | 1 | Stub environment in Docker | Can't build anything | |
| | 2. Core Loop | 2-4 | Working env with terminal reward | No submittable environment | |
| | 3. Dense Reward | 5-7 | Trainable environment | Terminal-only reward may not train | |
| | 4. Training | 8-12 | Trained model + comparison | No "Technical Excellence" or "Demo" | |
| | 5. Polish | 13-16 | Blog + HF Space + GitHub | No submission | |
|
|
| **Minimum viable submission**: Phases 1-2-4-5 (skip dense reward, train with terminal-only). This is risky β terminal-only reward is sparse and may not produce a meaningful trained model β but it's submittable. |
|
|
| **Recommended path**: All 5 phases in order. Dense reward (Phase 3) is what makes training work and what demonstrates "Technical Excellence." |
|
|
| --- |
|
|
| ## 6. Post-Submission Improvements (Backlog) |
|
|
| These are improvements to pursue **only after** a working submission exists. Ordered by expected impact. |
|
|
| ### 6.1 Multi-Database Verification |
|
|
| **Impact**: High. Defends against "accidental correctness" β queries that return the right answer for the wrong reasons on one dataset. |
|
|
| **What**: For each question, generate 1-2 variant SQLite databases with the same schema but different data distributions. Answer must be correct across all variants. |
|
|
| **Implementation**: Script that takes a base DB + gold SQL, runs targeted mutations (see 6.2), re-runs gold SQL to get new expected answer, packages as variant DB. |
|
|
| ### 6.2 Metamorphic Testing Suite |
|
|
| Ten database mutations that catch common SQL errors without requiring "SQL correctness" checking: |
|
|
| | # | Test | What It Catches | Best For | |
| |---|---|---|---| |
| | 1 | Row-order permutation | Missing ORDER BY, positional dependencies | All types | |
| | 2 | Irrelevant row injection | Missing filters, wrong date logic | Filtered aggregates | |
| | 3 | Dangling entity injection | Incorrect outer joins, wrong join direction | Aggregates over fact tables | |
| | 4 | Key re-encoding (ID remap) | Hard-coded IDs, magnitude assumptions | All types | |
| | 5 | Duplicate bridge rows | Missing DISTINCT, cartesian joins | Unique counts | |
| | 6 | NULL perturbation | COUNT(col) vs COUNT(*), NULL comparison bugs | Counts, joins | |
| | 7 | Unit scaling (numeric Γ factor) | Wrong aggregation, measure/count confusion | Numeric aggregates | |
| | 8 | Noise facts (orphan FKs) | Missing joins, fact-only queries | Join-dependent queries | |
| | 9 | Tie injection at k-th boundary | Brittle top-k, missing tiebreak | Top-k/ranking | |
| | 10 | Label swap (category permutation) | Surface-string heuristics, shortcut patterns | Category-based queries | |
| |
| **MVP subset**: Tests 2, 4, 5 (highest signal, cheapest to implement). |
| |
| ### 6.3 Two-Tier Action Space (RA/CTE Mode) |
| |
| **Impact**: Medium. Makes intermediate rewards easier (subset/superset signals on intermediate tables) and improves dialect portability. |
| |
| **What**: Add structured relational algebra operations (filter, join, group, union) as an alternative to raw SQL. Agent can build intermediate tables step-by-step. |
| |
| **Why it helps**: QueryGym shows this makes the environment more RL-friendly β less syntax failure, easier partial credit, engine-agnostic. But it's significant implementation work. |
| |
| **When to add**: Only if raw SQL + dense reward proves insufficient for training. |
| |
| ### 6.4 Difficulty Curriculum |
| |
| **What**: Organize questions into Easy/Medium/Hard tiers. Start training on Easy, progress to Medium/Hard as the agent improves. |
| |
| | Level | Schema | Query Type | Example | |
| |---|---|---|---| |
| | Easy | 1-2 tables, <10 cols | SELECT, WHERE, COUNT | "How many orders in January?" | |
| | Medium | 3-5 tables, JOINs | JOIN, GROUP BY, HAVING | "Top 3 customers by total spend" | |
| | Hard | 5+ tables, subqueries | Nested queries, CTEs | "Customers in every product category" | |
| |
| **When to add**: After basic training works. Curriculum learning can significantly improve convergence on harder questions. |
| |
| ### 6.5 Additional Question Sources |
| |
| | Source | Benefit | Effort | |
| |---|---|---| |
| | BIRD | Richer real-world databases, harder questions | Medium (different format, needs adaptation) | |
| | WikiTableQuestions | Simple single-table questions, good for easy tier | Low | |
| | Custom-generated | Control over difficulty distribution | High (need to write questions + verify gold) | |
| |
| ### 6.6 Structural SQL Similarity Signals |
| |
| **Impact**: Low-Medium. Additional reward signal based on structural similarity between agent's SQL and gold SQL. |
| |
| **What**: Compare table references, join graph overlap, aggregate functions used. NOT lexical similarity (bigrams overfit to syntax style). |
| |
| **Caution**: Keep weight very low. SQL-TRAIL uses this but acknowledges it can reward "copying style" over semantic correctness. |
| |
| ### 6.7 Observation Enhancements |
| |
| - Column statistics (min/max/distinct count) as part of DESCRIBE response |
| - Query execution plan as optional feedback |
| - "Hint" mode for progressive difficulty (reveal join paths after N failed queries) |
| |
| --- |
| |
| ## 7. Risk Register |
| |
| | Risk | Likelihood | Impact | Mitigation | |
| |---|---|---|---| |
| | **Terminal-only reward doesn't train** | High | High | Phase 3 (dense reward) is the primary mitigation. If needed, fall back to even simpler environment (fewer tables, easier questions) | |
| | **Dense reward leaks gold answer** | Medium | Medium | Coarsen progress to 5 bins, cap step rewards at 0.5, keep step signals small | |
| | **Agent learns to exploit reward shaping** | Medium | Medium | Anti-gaming measures (Section 3.6), monitor training for degenerate behavior | |
| | **Spider questions too easy / too hard for RL** | Medium | Medium | Curate question subset carefully; start with questions where random exploration has some chance of partial progress | |
| | **Training doesn't converge in time** | Medium | High | Start with a very small model (1.7B), short episodes, easy questions. A small improvement over random is sufficient for the demo | |
| | **WebSocket timeout during training** | Low | Medium | Increase `--ws-ping-interval` and `--ws-ping-timeout` to 300s in Dockerfile (documented in OpenEnv troubleshooting) | |
| | **SQLite limitations (no window functions, limited types)** | Low | Low | Spider questions are designed for SQLite. Avoid questions requiring features SQLite lacks | |
| | **Blog doesn't have compelling results** | Medium | High | Even negative results are interesting ("here's what we tried, here's what worked"). Focus story on the environment design, not just training outcomes | |
| |
| --- |
| |
| ## Appendix A: Key Differences from SQLEnv_Concept.md (v0) |
| |
| | Topic | v0 (SQLEnv_Concept.md) | v1 (This Document) | |
| |---|---|---| |
| | **Reward structure** | Static weights: 70% correctness, 20% efficiency, 10% quality | 3-layer architecture: operational validity + progress-to-target + terminal correctness, mapped to TRL reward_funcs | |
| | **Exploration penalty** | Quadratic penalty after 2 "free" queries | Constant step cost + repeat detection. No quadratic penalty (it punishes exploration) | |
| | **Query quality scoring** | "Appropriate JOINs" (subjective) | Removed. Replaced with objective signals (executed successfully, improved progress, no repeats) | |
| | **Reward implementation** | Unclear how to integrate with training | Explicit mapping to TRL GRPOTrainer pattern (reward_funcs + rollout_func) | |
| | **Multi-DB verification** | Listed as core feature | Moved to post-submission backlog. MVP uses single DB | |
| | **Scale database (10x)** | Listed as one of 3 DB variants | Dropped. Performance testing is irrelevant for correctness verification | |
| | **Deliverables** | Checklist without order | Phased MVP track with explicit "done when" gates | |
| | **Development timeline** | "2-3 weeks" (unstructured) | 5 phases with day estimates and dependencies | |
| |
| ## Appendix B: Research References |
| |
| | Paper / System | Key Idea Adopted | |
| |---|---| |
| | **SQL-TRAIL** (2026) | Multi-term reward panel: execution correctness + behavioral signals. Correctness must dominate | |
| | **PaVeRL-SQL** (2025) | Fractional execution accuracy (partial match). Avoid strong negative rewards | |
| | **QueryGym** (2025) | Subset/superset intermediate rewards. Two-tier action space (RA + SQL). POMDP framing | |
| | **Graph-Reward-SQL** (EMNLP 2025) | Stepwise CTE evaluation. Intermediate structure supervision | |
| | **OpenEnv Wordle GRPO** | TRL integration pattern: rollout_func + reward_funcs + GRPOTrainer | |
| | **Spider Test Suite** | Multi-database verification for semantic equivalence | |
| |
| ## Appendix C: Green Agent Wrapper (Sketch) |
| |
| ```python |
| class SQLGreenAgent: |
| """Automated evaluation agent for SQLEnv. |
| |
| Runs N episodes with a given policy (random, heuristic, or trained model), |
| reports success rate, avg reward, avg steps. |
| """ |
| |
| def __init__(self, env_client: SQLEnv, policy): |
| self.env = env_client |
| self.policy = policy |
| |
| def evaluate(self, n_episodes: int = 100) -> dict: |
| results = [] |
| for _ in range(n_episodes): |
| result = self.env.reset() |
| obs = result.observation |
| total_reward = 0 |
| |
| while not result.done: |
| action = self.policy.select_action(obs) |
| result = self.env.step(action) |
| obs = result.observation |
| total_reward += result.reward or 0 |
| |
| results.append({ |
| "correct": total_reward > 0.5, # terminal reward dominates |
| "total_reward": total_reward, |
| "steps": obs.step_count, |
| }) |
| |
| return { |
| "success_rate": sum(r["correct"] for r in results) / len(results), |
| "avg_reward": sum(r["total_reward"] for r in results) / len(results), |
| "avg_steps": sum(r["steps"] for r in results) / len(results), |
| } |
| ``` |
| |