Add comprehensive model card for CSC-SQL (#1)

302e09c 8 months ago

7.2 kB

	---
	license: cc-by-nc-4.0
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- text-to-sql
	- sql-generation
	- reinforcement-learning
	- qwen
	---

	# CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning

	The model presented in the paper [CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning](https://huggingface.co/papers/2505.13271).

	Abstract: Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72% execution accuracy, while the 32B model achieves 73.67%. The code has been open sourced at this https URL.

	Code: The code for CSC-SQL is open-sourced at [https://github.com/CycloneBoy/csc_sql](https://github.com/CycloneBoy/csc_sql).

	## Introduction

	CSC-SQL is a novel method that integrates Self-Consistency and Self-Correction for improved Text-to-SQL generation. It addresses limitations of prior methods by selecting optimal outputs and handling both syntactic and semantic errors. The approach employs Group Relative Policy Optimization (GRPO) to fine-tune SQL generation and revision models, leading to significant enhancements in output quality.

	![csc_sql_framework](https://github.com/CycloneBoy/csc_sql/raw/main/data/image/csc_sql_framework.png)

	## Main Results

	Performance Comparison of different Text-to-SQL methods on BIRD dev and test dataset.

	![csc_sql_result_main](https://github.com/CycloneBoy/csc_sql/raw/main/data/image/csc_sql_result_main.png)

	## Models

	A collection of CSC-SQL models can be found on Hugging Face: [CSC-SQL Hugging Face Collection](https://huggingface.co/collections/cycloneboy/csc-sql-6835c4a52da10c54bbe14f8e).

	\| Model and Dataset \| HuggingFace \|
	\|---------------------------------------\|--------------------------------------------------------------------------------------------\|
	\| CscSQL-Merge-Qwen2.5-Coder-3B-Instruct \| [🤗 HuggingFace](https://huggingface.co/cycloneboy/CscSQL-Merge-Qwen2.5-Coder-3B-Instruct) \|
	\| CscSQL-Merge-Qwen2.5-Coder-7B-Instruct \| [🤗 HuggingFace](https://huggingface.co/cycloneboy/CscSQL-Merge-Qwen2.5-Coder-7B-Instruct) \|
	\| CscSQL-Grpo-Qwen2.5-Coder-3B-Instruct \| [🤗 HuggingFace](https://huggingface.co/cycloneboy/CscSQL-Grpo-Qwen2.5-Coder-3B-Instruct) \|
	\| CscSQL-Grpo-XiYanSQL-QwenCoder-3B-2502 \| [🤗 HuggingFace](https://huggingface.co/cycloneboy/CscSQL-Grpo-XiYanSQL-QwenCoder-3B-2502) \|
	\| CscSQL-Grpo-Qwen2.5-Coder-7B-Instruct \| [🤗 HuggingFace](https://huggingface.co/cycloneboy/CscSQL-Grpo-Qwen2.5-Coder-7B-Instruct) \|
	\| CscSQL-Grpo-XiYanSQL-QwenCoder-7B-2502 \| [🤗 HuggingFace](https://huggingface.co/cycloneboy/CscSQL-Grpo-XiYanSQL-QwenCoder-7B-2502) \|

	## Dataset

	The BIRD training and development datasets used can be found here: [BIRD Train Dataset](https://huggingface.co/datasets/cycloneboy/bird_train).

	## Quickstart

	This section provides instructions on how to use the pre-trained CSC-SQL models.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

	model_dir = "cycloneboy/CscSQL-Grpo-Qwen2.5-Coder-7B-Instruct" # Or other released models

	def load_model_tokenizer(model_path):
	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	tokenizer.eos_token = "<\|im_end\|>"
	tokenizer.pad_token = "<\|endoftext\|>"
	tokenizer.eos_token_id = tokenizer.convert_tokens_to_ids(tokenizer.eos_token)
	tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
	tokenizer.padding_side = "left"

	model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto',torch_dtype=torch.bfloat16, trust_remote_code=True)
	return model, tokenizer

	# Example usage for a natural language question (Text-to-SQL)
	# Make sure your input string ends with "<\|im_start\|>assistant
	" for generation
	text_list = ["""
	<\|im_start\|>user
	Your task is to write a SQLite query given a natural language question and a database schema.
	You need to generate the SQL query that answers the question correctly.

	For example, to find out the names of all the songs, given:
	CREATE TABLE songs (
	song_id INTEGER PRIMARY KEY,
	song_name TEXT
	);
	Question: What are the names of all the songs?
	SQL: SELECT song_name FROM songs

	To find the artist of the song 'Yesterday', given:
	CREATE TABLE songs (
	song_id INTEGER PRIMARY KEY,
	song_name TEXT,
	artist_id INTEGER
	);
	CREATE TABLE artists (
	artist_id INTEGER PRIMARY KEY,
	artist_name TEXT
	);
	Question: Who is the artist of the song 'Yesterday'?
	SQL: SELECT T2.artist_name FROM songs AS T1 JOIN artists AS T2 ON T1.artist_id = T2.artist_id WHERE T1.song_name = 'Yesterday'

	Now, answer the following question.
	Question: How many records are there in the table 'songs'?
	SQL:
	<\|im_end\|>
	<\|im_start\|>assistant
	"""]

	model, tokenizer = load_model_tokenizer(model_dir)
	inputs = tokenizer(text_list, return_tensors='pt', padding=True, add_special_tokens=False).to('cuda')
	input_ids = inputs["input_ids"]
	attention_mask = inputs["attention_mask"]
	generation_config = GenerationConfig(
	eos_token_id=tokenizer.eos_token_id,
	pad_token_id=tokenizer.pad_token_id,
	temperature=0.1,
	max_new_tokens=512,
	num_return_sequences=1,
	num_beams=1,
	top_p=0.95,
	do_sample=False
	)
	outputs = model.generate(
	inputs= input_ids,
	attention_mask=attention_mask,
	**generation_config.to_dict()
	)
	gen_text = tokenizer.batch_decode(outputs[:, input_ids.shape[1]:], skip_special_tokens=True)
	print(gen_text[0])

	# Expected output: SELECT count(*) FROM songs
	```

	## Citation

	If you find our work useful or helpful for your R&D works, please feel free to cite our paper as below.

	```bibtex
	@misc{sheng2025cscsqlcorrectiveselfconsistencytexttosql,
	title={CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning},
	author={Lei Sheng and Shuai-Shuai Xu},
	year={2025},
	eprint={2505.13271},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2505.13271},
	}
	```