junmingg's picture
Upload README.md with huggingface_hub
7667c11 verified
|
Raw
History Blame Contribute Delete
6.04 kB
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
datasets:
- b-mc2/sql-create-context
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- text-to-sql
- text2sql
- sql
- qlora
- peft
- unsloth
- qwen2.5
---
# Qwen2.5-Coder-7B Text-to-SQL (QLoRA)
A QLoRA fine-tune of [`Qwen/Qwen2.5-Coder-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct)
that turns a **SQL schema + a natural-language question** into **a single SQL query**. Trained on a single
RTX 3090 with [Unsloth](https://github.com/unslothai/unsloth) + TRL, completion-only loss.
> Trained and evaluated by `junmingg`. Code + reproducible eval harness:
> https://github.com/junmingg/Unsloth-Qwen2.5-Coder-7b-Text-to-SQL-SFT
## Results (held-out 500 examples)
| Model | Exact match | Semantic equiv. | SQL validity |
|---|---|---|---|
| Base `Qwen2.5-Coder-7B-Instruct` (zero-shot) | 3.8% | 67.0% | 100.0% |
| **+ QLoRA (this model)** | **78.8%** | **86.2%** | **99.2%** |
![benchmark](benchmark.png)
- **Exact match** = canonicalized string match (via `sqlglot`); a strict *lower bound* β€” different-but-correct
SQL fails it.
- **Semantic equivalence** = an independent LLM judge (GLM-5-Turbo) decides whether the predicted SQL is
equivalent to the gold for the question. This is the headline correctness metric.
- **SQL validity** = fraction of predictions that parse under `sqlglot`.
Fine-tuning lifted exact-match by **+75.0 points** and semantic-equivalence by **+19.2 points** on prompts the
model never saw during training (0/500 test prompts appear in the training data). The large exact-match jump
is the model learning the dataset's SQL conventions (quoting, value formatting); the semantic jump is genuine
correctness beyond formatting.
### Label-noise ablation
`sql-create-context` carries a little label noise β€” 0.32% of train golds (and 0.60% of test golds) won't
parse under `sqlglot` (WikiSQL artifacts). A controlled variant
([`junmingg/qwen2.5-coder-7b-text2sql-filtered`](https://huggingface.co/junmingg/qwen2.5-coder-7b-text2sql-filtered))
drops those 80 noisy rows from the training set β€” **same recipe, same held-out test set**:
| Model | Exact match | Semantic equiv. | SQL validity |
|---|---|---|---|
| Base (zero-shot) | 3.8% | 67.0% | 100.0% |
| SFT β€” full 25k | 78.8% | 86.2% | 99.2% |
| SFT β€” filtered 24.9k | 78.6% | 85.4% | **99.8%** |
![ablation](benchmark_3way.png)
Cleaning 0.32% of train labels nudged **validity up, accuracy unchanged** (within run-to-run noise). On the
valid-reference subset the filtered model reaches 100% validity. Full write-up in the
[repo README](https://github.com/junmingg/Unsloth-Qwen2.5-Coder-7b-Text-to-SQL-SFT).
## Usage
The model was trained with a specific system prompt and ChatML format β€” **use the same formatting** for best
results.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
REPO = "junmingg/qwen2.5-coder-7b-text2sql"
model = AutoModelForCausalLM.from_pretrained(REPO, torch_dtype=torch.bfloat16, device_map="auto")
tok = AutoTokenizer.from_pretrained(REPO)
SYSTEM = ("You are a precise text-to-SQL engine. Given a SQL schema and a natural-language "
"question, respond with a single valid SQL query and nothing else.")
schema = "CREATE TABLE head (age INTEGER, name TEXT)"
question = "How many heads of the departments are older than 56?"
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f"Schema:\n{schema}\n\nQuestion: {question}"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
# -> SELECT COUNT(*) FROM head WHERE age > 56
```
LoRA adapters only (smaller): `junmingg/qwen2.5-coder-7b-text2sql-lora`.
GGUF quants (`q4_k_m`, `q8_0`) for llama.cpp / Ollama: `junmingg/qwen2.5-coder-7b-text2sql-GGUF`.
## Training
| | |
|---|---|
| Base | `Qwen/Qwen2.5-Coder-7B-Instruct` (4-bit QLoRA) |
| Data | [`b-mc2/sql-create-context`](https://huggingface.co/datasets/b-mc2/sql-create-context) (~78.6k rows total). Shuffled (seed 42), then a **25k train subset** + a fixed **500 held-out test** set. The remaining ~53k rows were intentionally unused β€” 1 epoch over 25k already saturates this task; adding the rest gives negligible lift while ~doubling/tripling train time. |
| LoRA | r=16, Ξ±=16, dropout=0, all linear layers |
| Schedule | 1 epoch (~1,557 steps), effective batch 16, lr 2e-4 cosine, warmup 0.03 |
| Precision / optim | bf16, `adamw_8bit`, `max_seq_length=2048` |
| Loss | completion-only (prompt masked; loss on the SQL answer only) |
| Hardware / time | 1Γ— RTX 3090, ~59 min |
## Evaluation methodology
Both base and fine-tuned models are evaluated with **identical** prompting and greedy decoding on the **same**
held-out 500-example test set (disjoint from train; verified 0 prompt overlap). Semantic equivalence is scored
by an independent LLM judge (GLM-5-Turbo). Reporting both exact-match and semantic-equivalence is deliberate:
exact-match is a lower bound, semantic-equivalence is the fairer correctness signal. The full harness is in the
linked repo (`python -m src.eval`, `python -m src.judge`).
## Limitations
- Trained on synthetic, largely single-table schemas (`sql-create-context`, derived from WikiSQL + Spider);
not evaluated for SQL-injection safety or complex multi-join queries.
- The dataset contains a small amount of label noise (~0.3–0.6% of gold answers are themselves unparseable β€”
WikiSQL artifacts); the model's rare "invalid" outputs are predominantly faithful reproductions of those
malformed references rather than novel errors.
- Outputs should be validated/parameterized before execution against a real database.
## License & attribution
Apache-2.0 (inherited from the base model). Training data: `b-mc2/sql-create-context` (CC-BY-4.0).