---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
datasets:
  - b-mc2/sql-create-context
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - text-to-sql
  - text2sql
  - sql
  - qlora
  - peft
  - unsloth
  - qwen2.5
---

# Qwen2.5-Coder-7B Text-to-SQL (QLoRA)

A QLoRA fine-tune of [`Qwen/Qwen2.5-Coder-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct)
that turns a **SQL schema + a natural-language question** into **a single SQL query**. Trained on a single
RTX 3090 with [Unsloth](https://github.com/unslothai/unsloth) + TRL, completion-only loss.

> Trained and evaluated by `junmingg`. Code + reproducible eval harness:
> https://github.com/junmingg/Unsloth-Qwen2.5-Coder-7b-Text-to-SQL-SFT

## Results (held-out 500 examples)

| Model | Exact match | Semantic equiv. | SQL validity |
|---|---|---|---|
| Base `Qwen2.5-Coder-7B-Instruct` (zero-shot) | 3.8% | 67.0% | 100.0% |
| **+ QLoRA (this model)** | **78.8%** | **86.2%** | **99.2%** |

![benchmark](benchmark.png)

- **Exact match** = canonicalized string match (via `sqlglot`); a strict *lower bound* — different-but-correct
  SQL fails it.
- **Semantic equivalence** = an independent LLM judge (GLM-5-Turbo) decides whether the predicted SQL is
  equivalent to the gold for the question. This is the headline correctness metric.
- **SQL validity** = fraction of predictions that parse under `sqlglot`.

Fine-tuning lifted exact-match by **+75.0 points** and semantic-equivalence by **+19.2 points** on prompts the
model never saw during training (0/500 test prompts appear in the training data). The large exact-match jump
is the model learning the dataset's SQL conventions (quoting, value formatting); the semantic jump is genuine
correctness beyond formatting.

### Label-noise ablation

`sql-create-context` carries a little label noise — 0.32% of train golds (and 0.60% of test golds) won't
parse under `sqlglot` (WikiSQL artifacts). A controlled variant
([`junmingg/qwen2.5-coder-7b-text2sql-filtered`](https://huggingface.co/junmingg/qwen2.5-coder-7b-text2sql-filtered))
drops those 80 noisy rows from the training set — **same recipe, same held-out test set**:

| Model | Exact match | Semantic equiv. | SQL validity |
|---|---|---|---|
| Base (zero-shot) | 3.8% | 67.0% | 100.0% |
| SFT — full 25k | 78.8% | 86.2% | 99.2% |
| SFT — filtered 24.9k | 78.6% | 85.4% | **99.8%** |

![ablation](benchmark_3way.png)

Cleaning 0.32% of train labels nudged **validity up, accuracy unchanged** (within run-to-run noise). On the
valid-reference subset the filtered model reaches 100% validity. Full write-up in the
[repo README](https://github.com/junmingg/Unsloth-Qwen2.5-Coder-7b-Text-to-SQL-SFT).

## Usage

The model was trained with a specific system prompt and ChatML format — **use the same formatting** for best
results.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

REPO = "junmingg/qwen2.5-coder-7b-text2sql"
model = AutoModelForCausalLM.from_pretrained(REPO, torch_dtype=torch.bfloat16, device_map="auto")
tok = AutoTokenizer.from_pretrained(REPO)

SYSTEM = ("You are a precise text-to-SQL engine. Given a SQL schema and a natural-language "
          "question, respond with a single valid SQL query and nothing else.")

schema = "CREATE TABLE head (age INTEGER, name TEXT)"
question = "How many heads of the departments are older than 56?"

messages = [
    {"role": "system", "content": SYSTEM},
    {"role": "user", "content": f"Schema:\n{schema}\n\nQuestion: {question}"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
# -> SELECT COUNT(*) FROM head WHERE age > 56
```

LoRA adapters only (smaller): `junmingg/qwen2.5-coder-7b-text2sql-lora`.
GGUF quants (`q4_k_m`, `q8_0`) for llama.cpp / Ollama: `junmingg/qwen2.5-coder-7b-text2sql-GGUF`.

## Training

| | |
|---|---|
| Base | `Qwen/Qwen2.5-Coder-7B-Instruct` (4-bit QLoRA) |
| Data | [`b-mc2/sql-create-context`](https://huggingface.co/datasets/b-mc2/sql-create-context) (~78.6k rows total). Shuffled (seed 42), then a **25k train subset** + a fixed **500 held-out test** set. The remaining ~53k rows were intentionally unused — 1 epoch over 25k already saturates this task; adding the rest gives negligible lift while ~doubling/tripling train time. |
| LoRA | r=16, α=16, dropout=0, all linear layers |
| Schedule | 1 epoch (~1,557 steps), effective batch 16, lr 2e-4 cosine, warmup 0.03 |
| Precision / optim | bf16, `adamw_8bit`, `max_seq_length=2048` |
| Loss | completion-only (prompt masked; loss on the SQL answer only) |
| Hardware / time | 1× RTX 3090, ~59 min |

## Evaluation methodology

Both base and fine-tuned models are evaluated with **identical** prompting and greedy decoding on the **same**
held-out 500-example test set (disjoint from train; verified 0 prompt overlap). Semantic equivalence is scored
by an independent LLM judge (GLM-5-Turbo). Reporting both exact-match and semantic-equivalence is deliberate:
exact-match is a lower bound, semantic-equivalence is the fairer correctness signal. The full harness is in the
linked repo (`python -m src.eval`, `python -m src.judge`).

## Limitations

- Trained on synthetic, largely single-table schemas (`sql-create-context`, derived from WikiSQL + Spider);
  not evaluated for SQL-injection safety or complex multi-join queries.
- The dataset contains a small amount of label noise (~0.3–0.6% of gold answers are themselves unparseable —
  WikiSQL artifacts); the model's rare "invalid" outputs are predominantly faithful reproductions of those
  malformed references rather than novel errors.
- Outputs should be validated/parameterized before execution against a real database.

## License & attribution

Apache-2.0 (inherited from the base model). Training data: `b-mc2/sql-create-context` (CC-BY-4.0).