File size: 6,809 Bytes
34d2fe8 5dd1bb4 34d2fe8 99fb2fb a001a97 5dd1bb4 34d2fe8 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 a19eef8 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 9e64e71 a001a97 9e64e71 a001a97 9e64e71 a001a97 9e64e71 a001a97 9e64e71 a001a97 9e64e71 a001a97 9e64e71 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 5dd1bb4 a001a97 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
title: SQLEnv
emoji: π€
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: true
base_path: /web
---
# SQLEnv: Teaching Small Models to Explore Databases



SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.
Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and trained with [TRL](https://huggingface.co/docs/trl)'s GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.
**[Blog post](https://hjerpe-sqlenv-blog.static.hf.space)** | **[Live environment](https://huggingface.co/spaces/hjerpe/sql_env)** | **[Training notebook](notebooks/train_grpo.ipynb)**
## Quick Start
```bash
uv sync
uv run pytest tests/ -v
```
Run the environment locally:
```bash
uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```
Or with Docker:
```bash
docker build -t sqlenv:latest -f server/Dockerfile .
docker run -p 8000:8000 sqlenv:latest
```
## How It Works
Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:
| Action | Purpose |
|--------|---------|
| `DESCRIBE table` | Reveal column names, types, and row count |
| `SAMPLE table` | Preview representative rows |
| `QUERY sql` | Execute read-only SQL |
| `ANSWER value` | Submit a final answer (ends episode) |
The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for details on the reward architecture.
```python
from server.sql_environment import SQLEnvironment, SQLAction
env = SQLEnvironment(questions_path="data/questions/questions_train.json",
db_dir="data/databases", tokenizer=tok)
obs = env.reset(seed=42)
obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
# obs.done=True, obs.reward=1.0
```
## Training
We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) using [GRPO](https://arxiv.org/abs/2402.03300) (from DeepSeekMath) through TRL's `environment_factory`. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.
**Notebooks:**
- **[train_grpo.ipynb](notebooks/train_grpo.ipynb)** runs the full SFT + GRPO pipeline
- **[compare_methods.ipynb](notebooks/compare_methods.ipynb)** evaluates base vs trained models
- **[showcase_sqlenv.ipynb](notebooks/showcase_sqlenv.ipynb)** lets you explore the environment interactively
**Local test (CPU, ~3 min):**
```bash
docker build -f Dockerfile.test -t sqlenv-test .
docker run --rm sqlenv-test
```
## Evaluation
All evaluation runs through the Green Agent evaluator:
```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy
result = evaluate(env, policy, n_episodes=50, seed=0)
print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
```
Results on our curated 10-database Spider subset (N=50, 2 runs):
| Method | Accuracy | Parse Rate | Avg Steps |
|--------|----------|------------|-----------|
| Zero-shot | 0% | 24-28% | 10.8-12.4 |
| 1-shot | 0-2% | 16-17% | 14.0-14.8 |
| 3-shot | 0% | 19-20% | 13.8-14.8 |
| GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 |
| GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 |
This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for detailed analysis.
## Data
676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See [docs/data-sources.md](docs/data-sources.md) for full details on provenance, curation, and regeneration.
Data in `data/` is adapted from [Spider](https://yale-lily.github.io/spider) (Yu et al., 2018) and shared under CC BY-SA 4.0. See [DATA_LICENSE](DATA_LICENSE).
## Project Structure
```
sqlenv/
βββ __init__.py, client.py, models.py # Core types and client
βββ server/
β βββ app.py # FastAPI server
β βββ sql_environment.py # Environment implementation
β βββ reward.py # Three-layer reward function
β βββ verifier.py # Answer verification
β βββ Dockerfile # HF Spaces deployment
βββ evaluation/ # Green Agent evaluator, policies
βββ training/ # TRL adapter, data loading
βββ scripts/ # Data curation, SFT generation
βββ notebooks/ # Training, evaluation, showcase
βββ data/
β βββ databases/ # 10 Spider SQLite databases
β βββ questions/ # Train/eval question sets
β βββ sft/ # SFT warmup trajectories
βββ configs/ # Training configurations
βββ tests/ # Unit and integration tests
βββ docs/
βββ data-sources.md # Data provenance
βββ ARCHITECTURE.md # System architecture
```
## References
- Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://yale-lily.github.io/spider). EMNLP.
- Shao et al. (2024). [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). (GRPO algorithm)
- Ng, Harada, Russell (1999). [Policy Invariance Under Reward Transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML.
- [OpenEnv framework](https://github.com/meta-pytorch/OpenEnv)
- [TRL OpenEnv docs](https://huggingface.co/docs/trl/openenv)
## License
Code: [MIT](LICENSE). Data: [CC BY-SA 4.0](DATA_LICENSE).
|