sql_env / README.md
hjerpe's picture
Upload folder using huggingface_hub
a001a97 verified
metadata
title: SQLEnv
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: true
base_path: /web

SQLEnv: Teaching Small Models to Explore Databases

Python License Data

SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.

Built on OpenEnv and trained with TRL's GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.

Blog post | Live environment | Training notebook

Quick Start

uv sync
uv run pytest tests/ -v

Run the environment locally:

uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000

Or with Docker:

docker build -t sqlenv:latest -f server/Dockerfile .
docker run -p 8000:8000 sqlenv:latest

How It Works

Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:

Action Purpose
DESCRIBE table Reveal column names, types, and row count
SAMPLE table Preview representative rows
QUERY sql Execute read-only SQL
ANSWER value Submit a final answer (ends episode)

The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the blog post for details on the reward architecture.

from server.sql_environment import SQLEnvironment, SQLAction

env = SQLEnvironment(questions_path="data/questions/questions_train.json",
                     db_dir="data/databases", tokenizer=tok)
obs = env.reset(seed=42)
obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
# obs.done=True, obs.reward=1.0

Training

We train Qwen3-0.6B using GRPO (from DeepSeekMath) through TRL's environment_factory. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.

Notebooks:

Local test (CPU, ~3 min):

docker build -f Dockerfile.test -t sqlenv-test .
docker run --rm sqlenv-test

Evaluation

All evaluation runs through the Green Agent evaluator:

from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

result = evaluate(env, policy, n_episodes=50, seed=0)
print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")

Results on our curated 10-database Spider subset (N=50, 2 runs):

Method Accuracy Parse Rate Avg Steps
Zero-shot 0% 24-28% 10.8-12.4
1-shot 0-2% 16-17% 14.0-14.8
3-shot 0% 19-20% 13.8-14.8
GRPO v1 (2 epochs) 28-30% 95-100% 3.5-4.0
GRPO v2 (4 epochs) 24-32% 87-95% 3.5-4.0

This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the blog post for detailed analysis.

Data

676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See docs/data-sources.md for full details on provenance, curation, and regeneration.

Data in data/ is adapted from Spider (Yu et al., 2018) and shared under CC BY-SA 4.0. See DATA_LICENSE.

Project Structure

sqlenv/
β”œβ”€β”€ __init__.py, client.py, models.py    # Core types and client
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                           # FastAPI server
β”‚   β”œβ”€β”€ sql_environment.py               # Environment implementation
β”‚   β”œβ”€β”€ reward.py                        # Three-layer reward function
β”‚   β”œβ”€β”€ verifier.py                      # Answer verification
β”‚   └── Dockerfile                       # HF Spaces deployment
β”œβ”€β”€ evaluation/                          # Green Agent evaluator, policies
β”œβ”€β”€ training/                            # TRL adapter, data loading
β”œβ”€β”€ scripts/                             # Data curation, SFT generation
β”œβ”€β”€ notebooks/                           # Training, evaluation, showcase
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ databases/                       # 10 Spider SQLite databases
β”‚   β”œβ”€β”€ questions/                       # Train/eval question sets
β”‚   └── sft/                             # SFT warmup trajectories
β”œβ”€β”€ configs/                             # Training configurations
β”œβ”€β”€ tests/                               # Unit and integration tests
└── docs/
    β”œβ”€β”€ data-sources.md                  # Data provenance
    └── ARCHITECTURE.md                  # System architecture

References

License

Code: MIT. Data: CC BY-SA 4.0.