Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / docs_draft /SQLEnv_Concept_v1.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 28 days ago

preview code

raw

history blame contribute delete

29.3 kB

	# SQLEnv: Interactive Database Query Environment — v1 Spec

	## OpenEnv Challenge Submission Design & MVP Development Plan

	---

	## 1. What We're Building & Delivering

	### 1.1 One-Sentence Summary

	SQLEnv is an RL environment where an agent answers natural language questions about databases through iterative SQL exploration — schema inspection, exploratory queries, result observation, and refinement — before submitting a final answer.

	### 1.2 Why It's Novel

	No RL environment for interactive SQL exists. Text-to-SQL benchmarks (Spider, BIRD, WikiSQL) are static single-shot evaluations. SQLEnv transforms this into a multi-turn exploration problem where agents develop query strategies through trial, error, and feedback. This maps directly to how real data analysts work.

	### 1.3 Submission Artifacts (Mapped to Evaluation Criteria)

	\| Evaluation Criterion \| Artifact \| Description \|
	\|---\|---\|---\|
	\| Creative and Robust use of OpenEnv \| SQLEnv environment on HF Hub \| Full environment: models, server, client, Dockerfile, `openenv.yaml` \|
	\| Technical Excellence \| 3-layer reward architecture + multi-type answer verification \| Dense stepwise reward (not just terminal), typed answer comparison \|
	\| Story-telling \| HuggingFace blog post \| "Teaching AI to think like a data analyst" — untrained vs. trained agent \|
	\| Open Source Demo \| Training notebooks/scripts on GitHub \| GRPO training script (TRL-compatible), baseline comparison \|
	\| Green Agent wrapper \| Green Agent class \| Automated evaluation wrapper following OpenEnv pattern \|

	### 1.4 What Exactly Gets Submitted

	1. HF Hub Space: Docker container running SQLEnv server (WebSocket API)
	2. GitHub repo: Environment source + GRPO training notebook + results
	3. HF Blog post: Narrative + learning curves + side-by-side demo (untrained vs. trained)

	---

	## 2. Environment Design

	### 2.1 OpenEnv Integration Architecture

	```
	envs/sql_env/
	├── __init__.py
	├── models.py # SQLAction, SQLObservation (Pydantic)
	├── client.py # SQLEnv(EnvClient) — WebSocket client
	├── openenv.yaml # Environment manifest
	├── pyproject.toml
	├── uv.lock
	├── data/ # SQLite databases + question sets
	│ ├── databases/ # Spider DB files
	│ └── questions/ # JSON question sets with gold answers
	└── server/
	├── __init__.py
	├── app.py # create_app(SQLEnvironment, SQLAction, SQLObservation)
	├── environment.py # SQLEnvironment(Environment) — core logic
	├── reward.py # Reward computation (3 layers)
	├── verifier.py # Answer comparison (multi-type)
	├── requirements.txt
	└── Dockerfile
	```

	### 2.2 Pydantic Models

	```python
	# models.py
	from pydantic import Field
	from openenv.core.env_server.types import Action, Observation, State

	class SQLAction(Action):
	"""What the agent sends each step."""
	action_type: str = Field(
	...,
	description="One of: DESCRIBE, SAMPLE, QUERY, ANSWER"
	)
	argument: str = Field(
	...,
	description="Table name (for DESCRIBE/SAMPLE), SQL string (for QUERY), or answer value (for ANSWER)"
	)

	class SQLObservation(Observation):
	"""What the agent receives after each step."""
	# Inherited: done (bool), reward (float \| None)
	question: str = Field(..., description="The NL question to answer")
	schema_info: str = Field(..., description="Database schema description")
	result: str = Field(default="", description="Result of the last action (truncated)")
	error: str = Field(default="", description="Error message if action failed")
	step_count: int = Field(default=0, description="Current step number")
	budget_remaining: int = Field(default=0, description="Steps left before timeout")
	action_history: list[str] = Field(
	default_factory=list,
	description="Summary of previous actions taken"
	)
	```

	Design note: `result` is a string, not raw data. Results are always truncated/summarized (max N rows as formatted text). This is intentional — the agent sees "what a real analyst would see", not the full database. This makes the environment a POMDP, which is appropriate for the task and beneficial for learning dynamics (see Section 3).

	### 2.3 State

	Uses the core `State` class from OpenEnv (`episode_id` + `step_count`). No custom state needed for MVP.

	### 2.4 Action Space

	\| Action \| Argument \| Effect \| Cost \|
	\|---\|---\|---\|---\|
	\| `DESCRIBE` \| table_name \| Returns column names, types, row count \| 1 step \|
	\| `SAMPLE` \| table_name \| Returns 5 random rows (formatted text) \| 1 step \|
	\| `QUERY` \| sql_string \| Executes SQL, returns truncated results (max 20 rows) \| 1 step \|
	\| `ANSWER` \| value \| Submits final answer, ends episode \| 0 steps (terminal) \|

	Step budget: 15 steps per episode (configurable). This is enough for 2-3 exploration actions + 3-5 query attempts + answer. Keeps episodes short enough for efficient training.

	Query sandboxing: All SQL runs in a read-only SQLite connection with a statement timeout (5 seconds). Only SELECT statements allowed. No writes, no DDL, no pragmas.

	### 2.5 Episode Lifecycle

	```
	┌─────────────────────────────────────────────────────────────┐
	│ reset() │
	│ → Pick random question from question set │
	│ → Load corresponding SQLite database (read-only) │
	│ → Return initial observation: │
	│ question, schema_info (table names only), budget=15 │
	└──────────────────────────┬──────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ step(action) loop │
	│ │
	│ DESCRIBE → return columns/types for requested table │
	│ SAMPLE → return 5 random rows from requested table │
	│ QUERY → execute SQL, return truncated result or error │
	│ ANSWER → compare to gold, compute terminal reward, done=T │
	│ │
	│ Each non-ANSWER step: compute stepwise reward, decrement │
	│ budget. If budget=0 and no ANSWER: done=True, reward=0 │
	└─────────────────────────────────────────────────────────────┘
	```

	Important: On `reset()`, the agent sees only table names in `schema_info`, not full column details. The agent must actively DESCRIBE tables to learn the schema. This is a deliberate design choice — it creates an exploration incentive and mirrors real-world "I have a database, what's in it?" workflows.

	---

	## 3. Reward Architecture

	The reward research (SQL-TRAIL, PaVeRL-SQL, QueryGym, Graph-Reward-SQL) converges on one principle: reward improvement during exploration, but make correctness dominate at termination.

	Our reward has 3 layers. The total episode reward is:

	```
	R_episode = R_terminal + sum(R_step_i) where sum(R_step_i) is capped at 0.5
	```

	### 3.1 Layer 1 — Operational Validity (every step, no gold reference needed)

	These signals reward "being able to operate the tool" without referencing the gold answer.

	\| Signal \| Value \| When \|
	\|---\|---\|---\|
	\| `r_exec_ok` \| +0.02 \| A QUERY executes without error \|
	\| `r_new_info` \| +0.01 \| Action reveals new info (new table described, new column seen). Capped at 0.10 per episode \|
	\| `r_repeat` \| -0.01 \| Exact same SQL run again (hash match), or exact same DESCRIBE/SAMPLE repeated \|
	\| `r_cost` \| -0.005 \| Every step (small constant, keeps trajectories short) \|

	Design rationale: Prefer small positive signals + "no reward" over heavy negative penalties. Strong negatives make agents risk-averse and destabilize training (PaVeRL-SQL). The step cost is tiny — just enough to prefer shorter trajectories when everything else is equal.

	### 3.2 Layer 2 — Progress-to-Target (QUERY steps only, oracle-based but coarsened)

	After each QUERY, we compute how close the query result is to the gold answer, then reward only improvement over the best-so-far.

	Progress metric (depends on answer type):

	\| Answer Type \| Progress Computation \|
	\|---\|---\|
	\| Integer/Count \| `1 - min(1, abs(pred - gold) / max(1, abs(gold)))` \|
	\| Float/Average \| Same as integer but with tolerance (within 1% = 1.0) \|
	\| String/Name \| Exact match = 1.0, else 0.0 (too fragile to partially score) \|
	\| List/Set \| `Jaccard(pred_set, gold_set)` — set overlap \|
	\| Table \| `0.5 * column_overlap + 0.5 * row_sample_overlap` \|

	Coarsening: To prevent reward leakage (agent hill-climbing on the reward signal instead of reasoning), we bin progress into 5 levels: {0, 0.25, 0.5, 0.75, 1.0}.

	Improvement-only reward:
	```python
	progress_binned = bin_to_nearest([0, 0.25, 0.5, 0.75, 1.0], progress_raw)
	r_progress = max(0, progress_binned - best_progress_so_far) * 0.15
	best_progress_so_far = max(best_progress_so_far, progress_binned)
	```

	This prevents agents from "farming" reward by oscillating between good and bad queries.

	### 3.3 Layer 3 — Terminal Correctness (ANSWER action only)

	\| Condition \| Reward \|
	\|---\|---\|
	\| Answer matches gold (type-appropriate comparison) \| +1.0 \|
	\| Answer does not match \| 0.0 \|
	\| Episode times out (budget exhausted, no ANSWER) \| 0.0 \|

	Terminal correctness is always the dominant signal. With the cap on stepwise rewards (0.5 max), a correct answer is always worth at least 2x the maximum exploration reward.

	### 3.4 Total Reward Formula

	```
	R_episode = 1.0 * R_correct + clamp(sum(r_exec_ok + r_new_info + r_repeat + r_cost + r_progress), -0.2, 0.5)
	```

	The negative clamp at -0.2 prevents catastrophic negative episodes from destabilizing training.

	### 3.5 How This Maps to TRL's GRPOTrainer

	Following the Wordle GRPO tutorial pattern, we expose reward components as separate `reward_funcs`:

	```python
	# In the training script
	trainer = GRPOTrainer(
	model=model_name,
	processing_class=tokenizer,
	reward_funcs=[
	reward_correctness, # Terminal: 0.0 or 1.0
	reward_progress, # Cumulative progress improvement (0 to 0.15)
	reward_operational, # Sum of exec_ok, new_info, repeat, cost signals
	],
	train_dataset=dataset,
	args=grpo_config,
	rollout_func=rollout_func,
	)
	```

	The `rollout_func` runs a full episode against the SQLEnv server (via WebSocket), collects the observation stream, and computes per-component rewards. This lets TRL handle the weighting/normalization.

	### 3.6 Anti-Gaming Measures

	\| Risk \| Mitigation \|
	\|---\|---\|
	\| Agent hill-climbs on progress signal to infer gold answer \| Coarsen progress to 5 bins; step rewards are small vs. terminal \|
	\| Agent DESCRIBEs everything to harvest `r_new_info` \| Cap `r_new_info` at 0.10 per episode \|
	\| Agent submits random answers hoping for partial credit \| No partial credit on ANSWER — binary correctness only \|
	\| Agent runs many queries to accumulate `r_exec_ok` \| Step cost + budget limit make this net-negative after ~10 queries \|
	\| Agent repeats identical queries \| Hash-based repeat detection, penalty applied \|

	---

	## 4. Question Sourcing & Verification

	### 4.1 MVP: Spider Subset

	Why Spider: Most-studied text-to-SQL benchmark, gold SQL available for all questions, existing tooling for test suite generation, clean SQLite databases included, well-understood difficulty levels.

	MVP question set: 50-100 questions from Spider's dev set, selected to cover:
	- Simple (SELECT/WHERE/COUNT): ~40%
	- Medium (JOIN/GROUP BY): ~40%
	- Hard (subqueries/HAVING/multi-step): ~20%

	Answer types to support in MVP:
	- Integer (counts, sums)
	- Float (averages — with 1% tolerance)
	- String (single value lookups)
	- List (top-k results — order-insensitive set comparison)

	### 4.2 Answer Verification

	```python
	def verify_answer(predicted, gold, answer_type: str) -> bool:
	match answer_type:
	case "integer":
	return int(predicted) == int(gold)
	case "float":
	return abs(float(predicted) - float(gold)) / max(1, abs(float(gold))) < 0.01
	case "string":
	return str(predicted).strip().lower() == str(gold).strip().lower()
	case "list":
	return set(normalize(predicted)) == set(normalize(gold))
	case "table":
	return compare_tables(predicted, gold) # Column + row overlap
	```

	### 4.3 Question Metadata Format

	Each question in the JSON dataset includes:

	```json
	{
	"id": "spider_dev_042",
	"question": "How many employees are in the Engineering department?",
	"database": "company_db",
	"gold_sql": "SELECT COUNT(*) FROM employees WHERE department = 'Engineering'",
	"gold_answer": "42",
	"answer_type": "integer",
	"difficulty": "easy",
	"tables_involved": ["employees"]
	}
	```

	The `gold_sql` is used for progress computation (running the gold query on the DB to get the reference answer). The `gold_answer` is the cached expected result. Both are hidden from the agent.

	### 4.4 Multi-DB Verification (Post-MVP)

	MVP: Single database per question. Accept correct result on the original DB.

	Post-MVP: For each question, generate 1-2 variant databases with the same schema but different data. An answer must be correct across all variants.

	Variant generation strategies (prioritized):
	1. Irrelevant row injection — add records outside the question's filter scope
	2. Join multiplicity trap — add duplicates in bridge tables
	3. ID remap — apply bijection to primary keys, update foreign keys

	(See Section 6.2 for the full metamorphic testing backlog.)

	---

	## 5. MVP Development Track

	The goal is to get a working, submittable solution as fast as possible, then improve iteratively. Each phase has a "done when" gate. Do not start the next phase until the current one passes its gate.

	### Phase 1: Scaffold & Stub (Day 1)

	What: Run `openenv init sql_env`, customize the generated models/server/client stubs to match our Pydantic models. Get a Docker container that starts and responds to reset/step.

	Tasks:
	1. `openenv init sql_env`
	2. Replace generated models with `SQLAction`, `SQLObservation` from Section 2.2
	3. Implement stub `SQLEnvironment.reset()` → returns hardcoded observation
	4. Implement stub `SQLEnvironment.step()` → accepts action, returns hardcoded observation
	5. Implement `SQLEnv(EnvClient)` client with `_step_payload`, `_parse_result`, `_parse_state`
	6. `openenv build` and `openenv validate`

	Done when: `openenv validate --verbose` passes. Client can connect, reset, step, and receive typed observations.

	### Phase 2: Core Loop with Terminal Reward (Days 2-4)

	What: Wire up real SQLite databases, implement the action handlers, add terminal-only reward (binary correctness on ANSWER). This is a submittable environment — sparse reward, but functional.

	Tasks:
	1. Download Spider dev databases (SQLite files) and a curated question set (30-50 questions)
	2. Implement `reset()`: pick random question, load DB, return initial observation with table names
	3. Implement DESCRIBE handler: query `sqlite_master` + `PRAGMA table_info`
	4. Implement SAMPLE handler: `SELECT * FROM {table} ORDER BY RANDOM() LIMIT 5`
	5. Implement QUERY handler: execute read-only SQL with timeout, truncate results to 20 rows
	6. Implement ANSWER handler: compare to gold answer using `verify_answer()`, return reward
	7. Add budget tracking (15 steps), timeout handling
	8. SQL sandboxing: read-only connection, statement timeout, no DDL

	Done when: Can run a full episode manually — reset, DESCRIBE a table, run a query, submit an answer, get correct/incorrect reward. Works in Docker via WebSocket.

	### Phase 3: Dense Reward (Days 5-7)

	What: Add reward Layers 1 and 2 (operational validity + progress-to-target). This makes the environment trainable — agents get feedback before the terminal step.

	Tasks:
	1. Implement `reward.py` with the 3-layer reward computation
	2. Layer 1: Track executed queries (hash set for repeat detection), track schema exploration (set of described tables/columns), compute `r_exec_ok`, `r_new_info`, `r_repeat`, `r_cost`
	3. Layer 2: After each QUERY, run gold SQL on the DB, compare query result to gold result using the progress metric (type-dependent), bin to 5 levels, compute improvement-only reward
	4. Wire reward computation into `step()` — return stepwise reward in observation
	5. Add reward capping logic (sum of step rewards <= 0.5, negative floor at -0.2)
	6. Test: run episodes, verify reward signals are sensible

	Done when: Reward signal varies meaningfully across different agent behaviors (random exploration gives some small positive reward; targeted queries give progress reward; correct answer gives terminal reward).

	### Phase 4: Training Pipeline (Days 8-12)

	What: Implement the GRPO training script following the Wordle tutorial pattern. Train a small model. Produce baseline vs. trained comparison.

	Tasks:
	1. Write `rollout_func` that plays full SQLEnv episodes via WebSocket client
	2. Design system prompt for the SQL agent (schema understanding, query strategy, answer formatting)
	3. Implement reward functions for TRL (`reward_correctness`, `reward_progress`, `reward_operational`)
	4. Set up training config (Qwen3-1.7B or similar small model, GRPO via TRL+vLLM)
	5. Run training (start small: 100 episodes, observe learning curves)
	6. Implement Green Agent wrapper (automated evaluation: run N episodes, report success rate)
	7. Produce comparison: random policy vs. trained model (success rate, avg steps, avg reward)
	8. Debug and iterate on reward weights if training doesn't converge

	Done when: Trained model measurably outperforms random baseline on success rate. Training notebook runs end-to-end. Green Agent reports evaluation metrics.

	### Phase 5: Polish & Submit (Days 13-16)

	What: Deploy to HF, write blog, prepare demo.

	Tasks:
	1. `openenv push` to HuggingFace Spaces
	2. Clean up GitHub repo: README, requirements, training notebook, results
	3. Write HF blog post:
	- Hook: "What if AI could learn to query databases like a data analyst?"
	- Problem: Static benchmarks don't teach exploration strategy
	- Solution: SQLEnv — interactive environment with dense reward
	- Results: Learning curves, before/after comparison, example episodes
	- Technical: Reward architecture, OpenEnv integration
	4. Record/screenshot side-by-side demo (untrained vs. trained agent)
	5. Final validation: someone else can `pip install`, connect to HF Space, run training notebook

	Done when: All 3 submission artifacts are live (HF Space, GitHub repo, HF blog). Blog tells a compelling story with real results.

	### Phase Summary

	\| Phase \| Days \| Produces \| Risk if Skipped \|
	\|---\|---\|---\|---\|
	\| 1. Scaffold \| 1 \| Stub environment in Docker \| Can't build anything \|
	\| 2. Core Loop \| 2-4 \| Working env with terminal reward \| No submittable environment \|
	\| 3. Dense Reward \| 5-7 \| Trainable environment \| Terminal-only reward may not train \|
	\| 4. Training \| 8-12 \| Trained model + comparison \| No "Technical Excellence" or "Demo" \|
	\| 5. Polish \| 13-16 \| Blog + HF Space + GitHub \| No submission \|

	Minimum viable submission: Phases 1-2-4-5 (skip dense reward, train with terminal-only). This is risky — terminal-only reward is sparse and may not produce a meaningful trained model — but it's submittable.

	Recommended path: All 5 phases in order. Dense reward (Phase 3) is what makes training work and what demonstrates "Technical Excellence."

	---

	## 6. Post-Submission Improvements (Backlog)

	These are improvements to pursue only after a working submission exists. Ordered by expected impact.

	### 6.1 Multi-Database Verification

	Impact: High. Defends against "accidental correctness" — queries that return the right answer for the wrong reasons on one dataset.

	What: For each question, generate 1-2 variant SQLite databases with the same schema but different data distributions. Answer must be correct across all variants.

	Implementation: Script that takes a base DB + gold SQL, runs targeted mutations (see 6.2), re-runs gold SQL to get new expected answer, packages as variant DB.

	### 6.2 Metamorphic Testing Suite

	Ten database mutations that catch common SQL errors without requiring "SQL correctness" checking:

	\| # \| Test \| What It Catches \| Best For \|
	\|---\|---\|---\|---\|
	\| 1 \| Row-order permutation \| Missing ORDER BY, positional dependencies \| All types \|
	\| 2 \| Irrelevant row injection \| Missing filters, wrong date logic \| Filtered aggregates \|
	\| 3 \| Dangling entity injection \| Incorrect outer joins, wrong join direction \| Aggregates over fact tables \|
	\| 4 \| Key re-encoding (ID remap) \| Hard-coded IDs, magnitude assumptions \| All types \|
	\| 5 \| Duplicate bridge rows \| Missing DISTINCT, cartesian joins \| Unique counts \|
	\| 6 \| NULL perturbation \| COUNT(col) vs COUNT(*), NULL comparison bugs \| Counts, joins \|
	\| 7 \| Unit scaling (numeric × factor) \| Wrong aggregation, measure/count confusion \| Numeric aggregates \|
	\| 8 \| Noise facts (orphan FKs) \| Missing joins, fact-only queries \| Join-dependent queries \|
	\| 9 \| Tie injection at k-th boundary \| Brittle top-k, missing tiebreak \| Top-k/ranking \|
	\| 10 \| Label swap (category permutation) \| Surface-string heuristics, shortcut patterns \| Category-based queries \|

	MVP subset: Tests 2, 4, 5 (highest signal, cheapest to implement).

	### 6.3 Two-Tier Action Space (RA/CTE Mode)

	Impact: Medium. Makes intermediate rewards easier (subset/superset signals on intermediate tables) and improves dialect portability.

	What: Add structured relational algebra operations (filter, join, group, union) as an alternative to raw SQL. Agent can build intermediate tables step-by-step.

	Why it helps: QueryGym shows this makes the environment more RL-friendly — less syntax failure, easier partial credit, engine-agnostic. But it's significant implementation work.

	When to add: Only if raw SQL + dense reward proves insufficient for training.

	### 6.4 Difficulty Curriculum

	What: Organize questions into Easy/Medium/Hard tiers. Start training on Easy, progress to Medium/Hard as the agent improves.

	\| Level \| Schema \| Query Type \| Example \|
	\|---\|---\|---\|---\|
	\| Easy \| 1-2 tables, <10 cols \| SELECT, WHERE, COUNT \| "How many orders in January?" \|
	\| Medium \| 3-5 tables, JOINs \| JOIN, GROUP BY, HAVING \| "Top 3 customers by total spend" \|
	\| Hard \| 5+ tables, subqueries \| Nested queries, CTEs \| "Customers in every product category" \|

	When to add: After basic training works. Curriculum learning can significantly improve convergence on harder questions.

	### 6.5 Additional Question Sources

	\| Source \| Benefit \| Effort \|
	\|---\|---\|---\|
	\| BIRD \| Richer real-world databases, harder questions \| Medium (different format, needs adaptation) \|
	\| WikiTableQuestions \| Simple single-table questions, good for easy tier \| Low \|
	\| Custom-generated \| Control over difficulty distribution \| High (need to write questions + verify gold) \|

	### 6.6 Structural SQL Similarity Signals

	Impact: Low-Medium. Additional reward signal based on structural similarity between agent's SQL and gold SQL.

	What: Compare table references, join graph overlap, aggregate functions used. NOT lexical similarity (bigrams overfit to syntax style).

	Caution: Keep weight very low. SQL-TRAIL uses this but acknowledges it can reward "copying style" over semantic correctness.

	### 6.7 Observation Enhancements

	- Column statistics (min/max/distinct count) as part of DESCRIBE response
	- Query execution plan as optional feedback
	- "Hint" mode for progressive difficulty (reveal join paths after N failed queries)

	---

	## 7. Risk Register

	\| Risk \| Likelihood \| Impact \| Mitigation \|
	\|---\|---\|---\|---\|
	\| Terminal-only reward doesn't train \| High \| High \| Phase 3 (dense reward) is the primary mitigation. If needed, fall back to even simpler environment (fewer tables, easier questions) \|
	\| Dense reward leaks gold answer \| Medium \| Medium \| Coarsen progress to 5 bins, cap step rewards at 0.5, keep step signals small \|
	\| Agent learns to exploit reward shaping \| Medium \| Medium \| Anti-gaming measures (Section 3.6), monitor training for degenerate behavior \|
	\| Spider questions too easy / too hard for RL \| Medium \| Medium \| Curate question subset carefully; start with questions where random exploration has some chance of partial progress \|
	\| Training doesn't converge in time \| Medium \| High \| Start with a very small model (1.7B), short episodes, easy questions. A small improvement over random is sufficient for the demo \|
	\| WebSocket timeout during training \| Low \| Medium \| Increase `--ws-ping-interval` and `--ws-ping-timeout` to 300s in Dockerfile (documented in OpenEnv troubleshooting) \|
	\| SQLite limitations (no window functions, limited types) \| Low \| Low \| Spider questions are designed for SQLite. Avoid questions requiring features SQLite lacks \|
	\| Blog doesn't have compelling results \| Medium \| High \| Even negative results are interesting ("here's what we tried, here's what worked"). Focus story on the environment design, not just training outcomes \|

	---

	## Appendix A: Key Differences from SQLEnv_Concept.md (v0)

	\| Topic \| v0 (SQLEnv_Concept.md) \| v1 (This Document) \|
	\|---\|---\|---\|
	\| Reward structure \| Static weights: 70% correctness, 20% efficiency, 10% quality \| 3-layer architecture: operational validity + progress-to-target + terminal correctness, mapped to TRL reward_funcs \|
	\| Exploration penalty \| Quadratic penalty after 2 "free" queries \| Constant step cost + repeat detection. No quadratic penalty (it punishes exploration) \|
	\| Query quality scoring \| "Appropriate JOINs" (subjective) \| Removed. Replaced with objective signals (executed successfully, improved progress, no repeats) \|
	\| Reward implementation \| Unclear how to integrate with training \| Explicit mapping to TRL GRPOTrainer pattern (reward_funcs + rollout_func) \|
	\| Multi-DB verification \| Listed as core feature \| Moved to post-submission backlog. MVP uses single DB \|
	\| Scale database (10x) \| Listed as one of 3 DB variants \| Dropped. Performance testing is irrelevant for correctness verification \|
	\| Deliverables \| Checklist without order \| Phased MVP track with explicit "done when" gates \|
	\| Development timeline \| "2-3 weeks" (unstructured) \| 5 phases with day estimates and dependencies \|

	## Appendix B: Research References

	\| Paper / System \| Key Idea Adopted \|
	\|---\|---\|
	\| SQL-TRAIL (2026) \| Multi-term reward panel: execution correctness + behavioral signals. Correctness must dominate \|
	\| PaVeRL-SQL (2025) \| Fractional execution accuracy (partial match). Avoid strong negative rewards \|
	\| QueryGym (2025) \| Subset/superset intermediate rewards. Two-tier action space (RA + SQL). POMDP framing \|
	\| Graph-Reward-SQL (EMNLP 2025) \| Stepwise CTE evaluation. Intermediate structure supervision \|
	\| OpenEnv Wordle GRPO \| TRL integration pattern: rollout_func + reward_funcs + GRPOTrainer \|
	\| Spider Test Suite \| Multi-database verification for semantic equivalence \|

	## Appendix C: Green Agent Wrapper (Sketch)

	```python
	class SQLGreenAgent:
	"""Automated evaluation agent for SQLEnv.

	Runs N episodes with a given policy (random, heuristic, or trained model),
	reports success rate, avg reward, avg steps.
	"""

	def __init__(self, env_client: SQLEnv, policy):
	self.env = env_client
	self.policy = policy

	def evaluate(self, n_episodes: int = 100) -> dict:
	results = []
	for _ in range(n_episodes):
	result = self.env.reset()
	obs = result.observation
	total_reward = 0

	while not result.done:
	action = self.policy.select_action(obs)
	result = self.env.step(action)
	obs = result.observation
	total_reward += result.reward or 0

	results.append({
	"correct": total_reward > 0.5, # terminal reward dominates
	"total_reward": total_reward,
	"steps": obs.step_count,
	})

	return {
	"success_rate": sum(r["correct"] for r in results) / len(results),
	"avg_reward": sum(r["total_reward"] for r in results) / len(results),
	"avg_steps": sum(r["steps"] for r in results) / len(results),
	}
	```