Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / README.md

hjerpe

Upload folder using huggingface_hub

a001a97 verified 27 days ago

preview code

raw

history blame contribute delete

6.81 kB

metadata

title: SQLEnv
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: true
base_path: /web

SQLEnv: Teaching Small Models to Explore Databases

SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.

Built on OpenEnv and trained with TRL's GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.

Blog post | Live environment | Training notebook

Quick Start

uv sync
uv run pytest tests/ -v

Run the environment locally:

uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000

Or with Docker:

docker build -t sqlenv:latest -f server/Dockerfile .
docker run -p 8000:8000 sqlenv:latest

How It Works

Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:

Action	Purpose
`DESCRIBE table`	Reveal column names, types, and row count
`SAMPLE table`	Preview representative rows
`QUERY sql`	Execute read-only SQL
`ANSWER value`	Submit a final answer (ends episode)

The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the blog post for details on the reward architecture.

from server.sql_environment import SQLEnvironment, SQLAction

env = SQLEnvironment(questions_path="data/questions/questions_train.json",
                     db_dir="data/databases", tokenizer=tok)
obs = env.reset(seed=42)
obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
# obs.done=True, obs.reward=1.0

Training

We train Qwen3-0.6B using GRPO (from DeepSeekMath) through TRL's environment_factory. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.

Notebooks:

train_grpo.ipynb runs the full SFT + GRPO pipeline
compare_methods.ipynb evaluates base vs trained models
showcase_sqlenv.ipynb lets you explore the environment interactively

Local test (CPU, ~3 min):

docker build -f Dockerfile.test -t sqlenv-test .
docker run --rm sqlenv-test

Evaluation

All evaluation runs through the Green Agent evaluator:

from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

result = evaluate(env, policy, n_episodes=50, seed=0)
print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

Results on our curated 10-database Spider subset (N=50, 2 runs):

Method	Accuracy	Parse Rate	Avg Steps
Zero-shot	0%	24-28%	10.8-12.4
1-shot	0-2%	16-17%	14.0-14.8
3-shot	0%	19-20%	13.8-14.8
GRPO v1 (2 epochs)	28-30%	95-100%	3.5-4.0
GRPO v2 (4 epochs)	24-32%	87-95%	3.5-4.0

This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the blog post for detailed analysis.

Data

676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See docs/data-sources.md for full details on provenance, curation, and regeneration.

Data in data/ is adapted from Spider (Yu et al., 2018) and shared under CC BY-SA 4.0. See DATA_LICENSE.

Project Structure

sqlenv/
├── __init__.py, client.py, models.py    # Core types and client
├── server/
│   ├── app.py                           # FastAPI server
│   ├── sql_environment.py               # Environment implementation
│   ├── reward.py                        # Three-layer reward function
│   ├── verifier.py                      # Answer verification
│   └── Dockerfile                       # HF Spaces deployment
├── evaluation/                          # Green Agent evaluator, policies
├── training/                            # TRL adapter, data loading
├── scripts/                             # Data curation, SFT generation
├── notebooks/                           # Training, evaluation, showcase
├── data/
│   ├── databases/                       # 10 Spider SQLite databases
│   ├── questions/                       # Train/eval question sets
│   └── sft/                             # SFT warmup trajectories
├── configs/                             # Training configurations
├── tests/                               # Unit and integration tests
└── docs/
    ├── data-sources.md                  # Data provenance
    └── ARCHITECTURE.md                  # System architecture

References

Yu et al. (2018). Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. EMNLP.
Shao et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. (GRPO algorithm)
Ng, Harada, Russell (1999). Policy Invariance Under Reward Transformations. ICML.
OpenEnv framework
TRL OpenEnv docs

License

Code: MIT. Data: CC BY-SA 4.0.