Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / README.md

hjerpe

Upload folder using huggingface_hub

a001a97 verified 27 days ago

preview code

raw

history blame contribute delete

6.81 kB

	---
	title: SQLEnv
	emoji: 🤖
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 8000
	pinned: true
	base_path: /web
	---

	# SQLEnv: Teaching Small Models to Explore Databases

	![Python](https://img.shields.io/badge/python-3.12-blue.svg)
	![License](https://img.shields.io/badge/license-MIT-green.svg)
	![Data](https://img.shields.io/badge/data-CC%20BY--SA%204.0-orange.svg)

	SQLEnv is an RL environment for training small language models to answer questions about SQL databases through iterative exploration. Instead of producing one-shot SQL from a fully visible schema, the agent discovers the schema step by step using four tools: DESCRIBE, SAMPLE, QUERY, and ANSWER.

	Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) and trained with [TRL](https://huggingface.co/docs/trl)'s GRPO implementation. A 0.6B parameter model trained in this environment goes from 0% to ~30% accuracy on a curated Spider subset, learning to explore schemas, recover from SQL errors, and format answers correctly.

	[Blog post](https://hjerpe-sqlenv-blog.static.hf.space) \| [Live environment](https://huggingface.co/spaces/hjerpe/sql_env) \| [Training notebook](notebooks/train_grpo.ipynb)

	## Quick Start

	```bash
	uv sync
	uv run pytest tests/ -v
	```

	Run the environment locally:

	```bash
	uv run uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
	```

	Or with Docker:

	```bash
	docker build -t sqlenv:latest -f server/Dockerfile .
	docker run -p 8000:8000 sqlenv:latest
	```

	## How It Works

	Each episode starts with a natural-language question and a list of table names. The schema (columns, types, relationships) is hidden. The agent uses four actions to explore:

	\| Action \| Purpose \|
	\|--------\|---------\|
	\| `DESCRIBE table` \| Reveal column names, types, and row count \|
	\| `SAMPLE table` \| Preview representative rows \|
	\| `QUERY sql` \| Execute read-only SQL \|
	\| `ANSWER value` \| Submit a final answer (ends episode) \|

	The environment provides dense reward at each step (operational feedback + progress toward the answer) and a terminal reward for correctness (+1.0 correct, 0.0 wrong). See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for details on the reward architecture.

	```python
	from server.sql_environment import SQLEnvironment, SQLAction

	env = SQLEnvironment(questions_path="data/questions/questions_train.json",
	db_dir="data/databases", tokenizer=tok)
	obs = env.reset(seed=42)
	obs = env.step(SQLAction(action_type="DESCRIBE", argument="employee"))
	obs = env.step(SQLAction(action_type="QUERY", argument="SELECT COUNT(*) FROM employee"))
	obs = env.step(SQLAction(action_type="ANSWER", argument="10"))
	# obs.done=True, obs.reward=1.0
	```

	## Training

	We train [Qwen3-0.6B](https://arxiv.org/abs/2505.09388) using [GRPO](https://arxiv.org/abs/2402.03300) (from DeepSeekMath) through TRL's `environment_factory`. The full pipeline (SFT warmup + two-phase GRPO) runs in ~5 hours on a single Colab L4.

	Notebooks:
	- [train_grpo.ipynb](notebooks/train_grpo.ipynb) runs the full SFT + GRPO pipeline
	- [compare_methods.ipynb](notebooks/compare_methods.ipynb) evaluates base vs trained models
	- [showcase_sqlenv.ipynb](notebooks/showcase_sqlenv.ipynb) lets you explore the environment interactively

	Local test (CPU, ~3 min):

	```bash
	docker build -f Dockerfile.test -t sqlenv-test .
	docker run --rm sqlenv-test
	```

	## Evaluation

	All evaluation runs through the Green Agent evaluator:

	```python
	from sql_env.evaluation import evaluate, RandomPolicy, OraclePolicy

	result = evaluate(env, policy, n_episodes=50, seed=0)
	print(f"Accuracy: {result.success_rate:.1%}, Reward: {result.avg_reward:.3f}")
	```

	Results on our curated 10-database Spider subset (N=50, 2 runs):

	\| Method \| Accuracy \| Parse Rate \| Avg Steps \|
	\|--------\|----------\|------------\|-----------\|
	\| Zero-shot \| 0% \| 24-28% \| 10.8-12.4 \|
	\| 1-shot \| 0-2% \| 16-17% \| 14.0-14.8 \|
	\| 3-shot \| 0% \| 19-20% \| 13.8-14.8 \|
	\| GRPO v1 (2 epochs) \| 28-30% \| 95-100% \| 3.5-4.0 \|
	\| GRPO v2 (4 epochs) \| 24-32% \| 87-95% \| 3.5-4.0 \|

	This evaluation is not comparable to the official Spider leaderboard, which uses different scoring, full-schema input, and a broader database set. See the [blog post](https://hjerpe-sqlenv-blog.static.hf.space) for detailed analysis.

	## Data

	676 questions (473 train, 203 eval) across 10 Spider databases with difficulty labels, plus 120 multi-turn SFT warmup trajectories generated from gold SQL. See [docs/data-sources.md](docs/data-sources.md) for full details on provenance, curation, and regeneration.

	Data in `data/` is adapted from [Spider](https://yale-lily.github.io/spider) (Yu et al., 2018) and shared under CC BY-SA 4.0. See [DATA_LICENSE](DATA_LICENSE).

	## Project Structure

	```
	sqlenv/
	├── __init__.py, client.py, models.py # Core types and client
	├── server/
	│ ├── app.py # FastAPI server
	│ ├── sql_environment.py # Environment implementation
	│ ├── reward.py # Three-layer reward function
	│ ├── verifier.py # Answer verification
	│ └── Dockerfile # HF Spaces deployment
	├── evaluation/ # Green Agent evaluator, policies
	├── training/ # TRL adapter, data loading
	├── scripts/ # Data curation, SFT generation
	├── notebooks/ # Training, evaluation, showcase
	├── data/
	│ ├── databases/ # 10 Spider SQLite databases
	│ ├── questions/ # Train/eval question sets
	│ └── sft/ # SFT warmup trajectories
	├── configs/ # Training configurations
	├── tests/ # Unit and integration tests
	└── docs/
	├── data-sources.md # Data provenance
	└── ARCHITECTURE.md # System architecture
	```

	## References

	- Yu et al. (2018). [Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task](https://yale-lily.github.io/spider). EMNLP.
	- Shao et al. (2024). [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/abs/2402.03300). (GRPO algorithm)
	- Ng, Harada, Russell (1999). [Policy Invariance Under Reward Transformations](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf). ICML.
	- [OpenEnv framework](https://github.com/meta-pytorch/OpenEnv)
	- [TRL OpenEnv docs](https://huggingface.co/docs/trl/openenv)

	## License

	Code: [MIT](LICENSE). Data: [CC BY-SA 4.0](DATA_LICENSE).