sql_env / specs /behavior /dataset-curation.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified
# System Behavior: Dataset Curation
> Living document. Updated by `/archive-spec` when features are completed.
> Last archived: F004 on 2026-03-24
---
## ADDED
### Curation script produces enriched question dataset
<!-- since: F004 -->
Running `python scripts/curate_questions.py` produces two JSON files (`data/questions/questions_train.json` and `data/questions/questions_eval.json`) containing 100+ enriched questions across 10 Spider databases. Each question record includes `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved`, and `split` fields.
### Curation script downloads Spider SQLite databases on demand
<!-- since: F004 -->
Running `python scripts/curate_questions.py` downloads Spider SQLite database files into `data/databases/{db_id}/{db_id}.sqlite` for each configured database. Existing files are skipped.
### Curation script accepts validate-only mode
<!-- since: F004 -->
Running `python scripts/curate_questions.py --validate` validates the existing dataset files without downloading or re-generating. It checks field completeness, gold SQL execution, answer correctness, split integrity, and difficulty distribution. Returns exit code 0 if valid, 1 if invalid.
### Dataset provides train/eval split
<!-- since: F004 -->
The dataset is split into `questions_train.json` (approximately 70%) and `questions_eval.json` (approximately 30%) with no overlapping question IDs between the two files.
### Dataset covers multiple domains and difficulty levels
<!-- since: F004 -->
Questions span 10 Spider databases from diverse domains (education, entertainment, geography, automotive, HR, etc.) with difficulty distribution targeting approximately 40% easy, 40% medium, 20% hard based on the number of tables involved in each query.