title: SQLEnv
emoji: π€
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: true
base_path: /web
SQLEnv: Teaching Small Models to Explore Databases
SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.
Built on OpenEnv and trained with TRL's GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.
Blog post | Live environment | Training notebook
Quick Start
uv sync
uv run pytest tests/ -v
Run the environment locally:
uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
Or with Docker:
docker build -t sqlenv:latest -f server/Dockerfile .
docker run -p 8000:8000 sqlenv:latest
How It Works
Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:
| Action | Purpose |
|---|---|
DESCRIBE table |
Reveal column names, types, and row count |
SAMPLE table |
Preview representative rows |
QUERY sql |
Execute read-only SQL |
ANSWER value |
Submit a final answer (ends episode) |
The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the blog post for details on the reward architecture.
from server.sql_environment import SQLEnvironment, SQLAction
env = SQLEnvironment(questions_path="data/questions/questions_train.json",
db_dir="data/databases", tokenizer=tok)
obs = env.reset(seed=42)
obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
# obs.done=True, obs.reward=1.0
Training
We train Qwen3-0.6B using GRPO (from DeepSeekMath) through TRL's environment_factory. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.
Notebooks:
- train_grpo.ipynb runs the full SFT + GRPO pipeline
- compare_methods.ipynb evaluates base vs trained models
- showcase_sqlenv.ipynb lets you explore the environment interactively
Local test (CPU, ~3 min):
docker build -f Dockerfile.test -t sqlenv-test .
docker run --rm sqlenv-test
Evaluation
All evaluation runs through the Green Agent evaluator:
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
result = evaluate(env, policy, n_episodes=50, seed=0)
print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
Results on our curated 10-database Spider subset (N=50, 2 runs):
| Method | Accuracy | Parse Rate | Avg Steps |
|---|---|---|---|
| Zero-shot | 0% | 24-28% | 10.8-12.4 |
| 1-shot | 0-2% | 16-17% | 14.0-14.8 |
| 3-shot | 0% | 19-20% | 13.8-14.8 |
| GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 |
| GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 |
This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the blog post for detailed analysis.
Data
676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See docs/data-sources.md for full details on provenance, curation, and regeneration.
Data in data/ is adapted from Spider (Yu et al., 2018) and shared under CC BY-SA 4.0. See DATA_LICENSE.
Project Structure
sqlenv/
βββ __init__.py, client.py, models.py # Core types and client
βββ server/
β βββ app.py # FastAPI server
β βββ sql_environment.py # Environment implementation
β βββ reward.py # Three-layer reward function
β βββ verifier.py # Answer verification
β βββ Dockerfile # HF Spaces deployment
βββ evaluation/ # Green Agent evaluator, policies
βββ training/ # TRL adapter, data loading
βββ scripts/ # Data curation, SFT generation
βββ notebooks/ # Training, evaluation, showcase
βββ data/
β βββ databases/ # 10 Spider SQLite databases
β βββ questions/ # Train/eval question sets
β βββ sft/ # SFT warmup trajectories
βββ configs/ # Training configurations
βββ tests/ # Unit and integration tests
βββ docs/
βββ data-sources.md # Data provenance
βββ ARCHITECTURE.md # System architecture
References
- Yu et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. EMNLP.
- Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. (GRPO algorithm)
- Ng, Harada, Russell (1999). Policy Invariance Under Reward Transformations. ICML.
- OpenEnv framework
- TRL OpenEnv docs
License
Code: MIT. Data: CC BY-SA 4.0.