| --- |
| title: SQLEnv |
| emoji: π€ |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| app_port: 8000 |
| pinned: true |
| base_path: /web |
| --- |
| |
| # SQLEnv: Teaching Small Models to Explore Databases |
|
|
|  |
|  |
|  |
|
|
| SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER. |
|
|
| Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and trained with [TRL](https://huggingface.co/docs/trl)'s GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly. |
|
|
| **[Blog post](https://hjerpe-sqlenv-blog.static.hf.space)** | **[Live environment](https://huggingface.co/spaces/hjerpe/sql_env)** | **[Training notebook](notebooks/train_grpo.ipynb)** |
|
|
| ## Quick Start |
|
|
| ```bash |
| uv sync |
| uv run pytest tests/ -v |
| ``` |
|
|
| Run the environment locally: |
|
|
| ```bash |
| uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000 |
| ``` |
|
|
| Or with Docker: |
|
|
| ```bash |
| docker build -t sqlenv:latest -f server/Dockerfile . |
| docker run -p 8000:8000 sqlenv:latest |
| ``` |
|
|
| ## How It Works |
|
|
| Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore: |
|
|
| | Action | Purpose | |
| |--------|---------| |
| | `DESCRIBE table` | Reveal column names, types, and row count | |
| | `SAMPLE table` | Preview representative rows | |
| | `QUERY sql` | Execute read-only SQL | |
| | `ANSWER value` | Submit a final answer (ends episode) | |
|
|
| The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for details on the reward architecture. |
|
|
| ```python |
| from server.sql_environment import SQLEnvironment, SQLAction |
| |
| env = SQLEnvironment(questions_path="data/questions/questions_train.json", |
| db_dir="data/databases", tokenizer=tok) |
| obs = env.reset(seed=42) |
| obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee")) |
| obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee")) |
| obs = env.step(SQLAction(action_type="ANSWER", argument="10")) |
| # obs.done=True, obs.reward=1.0 |
| ``` |
|
|
| ## Training |
|
|
| We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) using [GRPO](https://arxiv.org/abs/2402.03300) (from DeepSeekMath) through TRL's `environment_factory`. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4. |
|
|
| **Notebooks:** |
| - **[train_grpo.ipynb](notebooks/train_grpo.ipynb)** runs the full SFT + GRPO pipeline |
| - **[compare_methods.ipynb](notebooks/compare_methods.ipynb)** evaluates base vs trained models |
| - **[showcase_sqlenv.ipynb](notebooks/showcase_sqlenv.ipynb)** lets you explore the environment interactively |
|
|
| **Local test (CPU, ~3 min):** |
|
|
| ```bash |
| docker build -f Dockerfile.test -t sqlenv-test . |
| docker run --rm sqlenv-test |
| ``` |
|
|
| ## Evaluation |
|
|
| All evaluation runs through the Green Agent evaluator: |
|
|
| ```python |
| from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy |
| |
| result = evaluate(env, policy, n_episodes=50, seed=0) |
| print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}") |
| ``` |
|
|
| Results on our curated 10-database Spider subset (N=50, 2 runs): |
|
|
| | Method | Accuracy | Parse Rate | Avg Steps | |
| |--------|----------|------------|-----------| |
| | Zero-shot | 0% | 24-28% | 10.8-12.4 | |
| | 1-shot | 0-2% | 16-17% | 14.0-14.8 | |
| | 3-shot | 0% | 19-20% | 13.8-14.8 | |
| | GRPO v1 (2 epochs) | 28-30% | 95-100% | 3.5-4.0 | |
| | GRPO v2 (4 epochs) | 24-32% | 87-95% | 3.5-4.0 | |
|
|
| This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for detailed analysis. |
|
|
| ## Data |
|
|
| 676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See [docs/data-sources.md](docs/data-sources.md) for full details on provenance, curation, and regeneration. |
|
|
| Data in `data/` is adapted from [Spider](https://yale-lily.github.io/spider) (Yu et al., 2018) and shared under CC BY-SA 4.0. See [DATA_LICENSE](DATA_LICENSE). |
|
|
| ## Project Structure |
|
|
| ``` |
| sqlenv/ |
| βββ __init__.py, client.py, models.py # Core types and client |
| βββ server/ |
| β βββ app.py # FastAPI server |
| β βββ sql_environment.py # Environment implementation |
| β βββ reward.py # Three-layer reward function |
| β βββ verifier.py # Answer verification |
| β βββ Dockerfile # HF Spaces deployment |
| βββ evaluation/ # Green Agent evaluator, policies |
| βββ training/ # TRL adapter, data loading |
| βββ scripts/ # Data curation, SFT generation |
| βββ notebooks/ # Training, evaluation, showcase |
| βββ data/ |
| β βββ databases/ # 10 Spider SQLite databases |
| β βββ questions/ # Train/eval question sets |
| β βββ sft/ # SFT warmup trajectories |
| βββ configs/ # Training configurations |
| βββ tests/ # Unit and integration tests |
| βββ docs/ |
| βββ data-sources.md # Data provenance |
| βββ ARCHITECTURE.md # System architecture |
| ``` |
|
|
| ## References |
|
|
| - Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://yale-lily.github.io/spider). EMNLP. |
| - Shao et al. (2024). [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). (GRPO algorithm) |
| - Ng, Harada, Russell (1999). [Policy Invariance Under Reward Transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML. |
| - [OpenEnv framework](https://github.com/meta-pytorch/OpenEnv) |
| - [TRL OpenEnv docs](https://huggingface.co/docs/trl/openenv) |
|
|
| ## License |
|
|
| Code: [MIT](LICENSE). Data: [CC BY-SA 4.0](DATA_LICENSE). |
|
|