Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /behavior /dataset-curation.md

hjerpe

Upload folder using huggingface_hub

5dd1bb4 verified about 2 months ago

preview code

raw

history blame contribute delete

1.83 kB

	# System Behavior: Dataset Curation

	> Living document. Updated by `/archive-spec` when features are completed.
	> Last archived: F004 on 2026-03-24

	---

	## ADDED

	### Curation script produces enriched question dataset
	<!-- since: F004 -->

	Running `python scripts/curate_questions.py` produces two JSON files (`data/questions/questions_train.json` and `data/questions/questions_eval.json`) containing 100+ enriched questions across 10 Spider databases. Each question record includes `question_id`, `question_text`, `database_name`, `gold_sql`, `gold_answer`, `answer_type`, `difficulty`, `tables_involved`, and `split` fields.

	### Curation script downloads Spider SQLite databases on demand
	<!-- since: F004 -->

	Running `python scripts/curate_questions.py` downloads Spider SQLite database files into `data/databases/{db_id}/{db_id}.sqlite` for each configured database. Existing files are skipped.

	### Curation script accepts validate-only mode
	<!-- since: F004 -->

	Running `python scripts/curate_questions.py --validate` validates the existing dataset files without downloading or re-generating. It checks field completeness, gold SQL execution, answer correctness, split integrity, and difficulty distribution. Returns exit code 0 if valid, 1 if invalid.

	### Dataset provides train/eval split
	<!-- since: F004 -->

	The dataset is split into `questions_train.json` (approximately 70%) and `questions_eval.json` (approximately 30%) with no overlapping question IDs between the two files.

	### Dataset covers multiple domains and difficulty levels
	<!-- since: F004 -->

	Questions span 10 Spider databases from diverse domains (education, entertainment, geography, automotive, HR, etc.) with difficulty distribution targeting approximately 40% easy, 40% medium, 20% hard based on the number of tables involved in each query.