File size: 6,809 Bytes
34d2fe8
5dd1bb4
 
 
 
34d2fe8
99fb2fb
a001a97
5dd1bb4
34d2fe8
 
a001a97
5dd1bb4
 
 
a001a97
5dd1bb4
a001a97
5dd1bb4
a001a97
5dd1bb4
a001a97
a19eef8
5dd1bb4
 
 
 
 
 
 
a001a97
5dd1bb4
 
 
 
 
a001a97
5dd1bb4
 
a001a97
 
5dd1bb4
 
 
 
a001a97
5dd1bb4
a001a97
 
 
 
 
 
5dd1bb4
a001a97
5dd1bb4
a001a97
 
5dd1bb4
a001a97
 
 
 
 
 
 
 
5dd1bb4
a001a97
5dd1bb4
a001a97
 
 
 
 
 
 
 
9e64e71
 
 
 
 
 
a001a97
9e64e71
a001a97
9e64e71
a001a97
 
9e64e71
a001a97
 
9e64e71
 
a001a97
 
 
 
 
 
 
 
 
9e64e71
a001a97
9e64e71
a001a97
5dd1bb4
a001a97
5dd1bb4
a001a97
5dd1bb4
 
 
a001a97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dd1bb4
 
a001a97
5dd1bb4
a001a97
 
 
 
 
5dd1bb4
a001a97
5dd1bb4
a001a97
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: SQLEnv
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: true
base_path: /web
---

# SQLEnv: Teaching Small Models to Explore Databases

![Python](https://img.shields.io/badge/python-3.12-blue.svg)
![License](https://img.shields.io/badge/license-MIT-green.svg)
![Data](https://img.shields.io/badge/data-CC%20BY--SA%204.0-orange.svg)

SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.

Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and trained with [TRL](https://huggingface.co/docs/trl)'s GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.

**[Blog post](https://hjerpe-sqlenv-blog.static.hf.space)** | **[Live environment](https://huggingface.co/spaces/hjerpe/sql_env)** | **[Training notebook](notebooks/train_grpo.ipynb)**

## Quick Start

```bash
uv sync
uv run pytest tests/ -v
```

Run the environment locally:

```bash
uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
```

Or with Docker:

```bash
docker build -t sqlenv:latest -f server/Dockerfile .
docker run -p 8000:8000 sqlenv:latest
```

## How It Works

Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:

| Action | Purpose |
|--------|---------|
| `DESCRIBE table` | Reveal column names, types, and row count |
| `SAMPLE table` | Preview representative rows |
| `QUERY sql` | Execute read-only SQL |
| `ANSWER value` | Submit a final answer (ends episode) |

The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for details on the reward architecture.

```python
from server.sql_environment import SQLEnvironment, SQLAction

env = SQLEnvironment(questions_path="data/questions/questions_train.json",
                     db_dir="data/databases", tokenizer=tok)
obs = env.reset(seed=42)
obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
# obs.done=True, obs.reward=1.0
```

## Training

We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) using [GRPO](https://arxiv.org/abs/2402.03300) (from DeepSeekMath) through TRL's `environment_factory`. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.

**Notebooks:**
- **[train_grpo.ipynb](notebooks/train_grpo.ipynb)** runs the full SFT + GRPO pipeline
- **[compare_methods.ipynb](notebooks/compare_methods.ipynb)** evaluates base vs trained models
- **[showcase_sqlenv.ipynb](notebooks/showcase_sqlenv.ipynb)** lets you explore the environment interactively

**Local test (CPU, ~3 min):**

```bash
docker build -f Dockerfile.test -t sqlenv-test .
docker run --rm sqlenv-test
```

## Evaluation

All evaluation runs through the Green Agent evaluator:

```python
from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

result = evaluate(env, policy, n_episodes=50, seed=0)
print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
```

Results on our curated 10-database Spider subset (N=50, 2 runs):

| Method | Accuracy | Parse Rate | Avg Steps |
|--------|----------|------------|-----------|
| Zero-shot | 0% | 24-28% | 10.8-12.4 |
| 1-shot | 0-2% | 16-17% | 14.0-14.8 |
| 3-shot | 0% | 19-20% | 13.8-14.8 |
| GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 |
| GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 |

This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for detailed analysis.

## Data

676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See [docs/data-sources.md](docs/data-sources.md) for full details on provenance, curation, and regeneration.

Data in `data/` is adapted from [Spider](https://yale-lily.github.io/spider) (Yu et al., 2018) and shared under CC BY-SA 4.0. See [DATA_LICENSE](DATA_LICENSE).

## Project Structure

```
sqlenv/
β”œβ”€β”€ __init__.py, client.py, models.py    # Core types and client
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                           # FastAPI server
β”‚   β”œβ”€β”€ sql_environment.py               # Environment implementation
β”‚   β”œβ”€β”€ reward.py                        # Three-layer reward function
β”‚   β”œβ”€β”€ verifier.py                      # Answer verification
β”‚   └── Dockerfile                       # HF Spaces deployment
β”œβ”€β”€ evaluation/                          # Green Agent evaluator, policies
β”œβ”€β”€ training/                            # TRL adapter, data loading
β”œβ”€β”€ scripts/                             # Data curation, SFT generation
β”œβ”€β”€ notebooks/                           # Training, evaluation, showcase
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ databases/                       # 10 Spider SQLite databases
β”‚   β”œβ”€β”€ questions/                       # Train/eval question sets
β”‚   └── sft/                             # SFT warmup trajectories
β”œβ”€β”€ configs/                             # Training configurations
β”œβ”€β”€ tests/                               # Unit and integration tests
└── docs/
    β”œβ”€β”€ data-sources.md                  # Data provenance
    └── ARCHITECTURE.md                  # System architecture
```

## References

- Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://yale-lily.github.io/spider). EMNLP.
- Shao et al. (2024). [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). (GRPO algorithm)
- Ng, Harada, Russell (1999). [Policy Invariance Under Reward Transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML.
- [OpenEnv framework](https://github.com/meta-pytorch/OpenEnv)
- [TRL OpenEnv docs](https://huggingface.co/docs/trl/openenv)

## License

Code: [MIT](LICENSE). Data: [CC BY-SA 4.0](DATA_LICENSE).