--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-7B-Instruct datasets: - b-mc2/sql-create-context language: - en library_name: transformers pipeline_tag: text-generation tags: - text-to-sql - text2sql - sql - qlora - peft - unsloth - qwen2.5 --- # Qwen2.5-Coder-7B Text-to-SQL (QLoRA) A QLoRA fine-tune of [`Qwen/Qwen2.5-Coder-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) that turns a **SQL schema + a natural-language question** into **a single SQL query**. Trained on a single RTX 3090 with [Unsloth](https://github.com/unslothai/unsloth) + TRL, completion-only loss. > Trained and evaluated by `junmingg`. Code + reproducible eval harness: > https://github.com/junmingg/Unsloth-Qwen2.5-Coder-7b-Text-to-SQL-SFT ## Results (held-out 500 examples) | Model | Exact match | Semantic equiv. | SQL validity | |---|---|---|---| | Base `Qwen2.5-Coder-7B-Instruct` (zero-shot) | 3.8% | 67.0% | 100.0% | | **+ QLoRA (this model)** | **78.8%** | **86.2%** | **99.2%** | ![benchmark](benchmark.png) - **Exact match** = canonicalized string match (via `sqlglot`); a strict *lower bound* — different-but-correct SQL fails it. - **Semantic equivalence** = an independent LLM judge (GLM-5-Turbo) decides whether the predicted SQL is equivalent to the gold for the question. This is the headline correctness metric. - **SQL validity** = fraction of predictions that parse under `sqlglot`. Fine-tuning lifted exact-match by **+75.0 points** and semantic-equivalence by **+19.2 points** on prompts the model never saw during training (0/500 test prompts appear in the training data). The large exact-match jump is the model learning the dataset's SQL conventions (quoting, value formatting); the semantic jump is genuine correctness beyond formatting. ### Label-noise ablation `sql-create-context` carries a little label noise — 0.32% of train golds (and 0.60% of test golds) won't parse under `sqlglot` (WikiSQL artifacts). A controlled variant ([`junmingg/qwen2.5-coder-7b-text2sql-filtered`](https://huggingface.co/junmingg/qwen2.5-coder-7b-text2sql-filtered)) drops those 80 noisy rows from the training set — **same recipe, same held-out test set**: | Model | Exact match | Semantic equiv. | SQL validity | |---|---|---|---| | Base (zero-shot) | 3.8% | 67.0% | 100.0% | | SFT — full 25k | 78.8% | 86.2% | 99.2% | | SFT — filtered 24.9k | 78.6% | 85.4% | **99.8%** | ![ablation](benchmark_3way.png) Cleaning 0.32% of train labels nudged **validity up, accuracy unchanged** (within run-to-run noise). On the valid-reference subset the filtered model reaches 100% validity. Full write-up in the [repo README](https://github.com/junmingg/Unsloth-Qwen2.5-Coder-7b-Text-to-SQL-SFT). ## Usage The model was trained with a specific system prompt and ChatML format — **use the same formatting** for best results. ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch REPO = "junmingg/qwen2.5-coder-7b-text2sql" model = AutoModelForCausalLM.from_pretrained(REPO, torch_dtype=torch.bfloat16, device_map="auto") tok = AutoTokenizer.from_pretrained(REPO) SYSTEM = ("You are a precise text-to-SQL engine. Given a SQL schema and a natural-language " "question, respond with a single valid SQL query and nothing else.") schema = "CREATE TABLE head (age INTEGER, name TEXT)" question = "How many heads of the departments are older than 56?" messages = [ {"role": "system", "content": SYSTEM}, {"role": "user", "content": f"Schema:\n{schema}\n\nQuestion: {question}"}, ] inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) out = model.generate(inputs, max_new_tokens=256, do_sample=False) print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True)) # -> SELECT COUNT(*) FROM head WHERE age > 56 ``` LoRA adapters only (smaller): `junmingg/qwen2.5-coder-7b-text2sql-lora`. GGUF quants (`q4_k_m`, `q8_0`) for llama.cpp / Ollama: `junmingg/qwen2.5-coder-7b-text2sql-GGUF`. ## Training | | | |---|---| | Base | `Qwen/Qwen2.5-Coder-7B-Instruct` (4-bit QLoRA) | | Data | [`b-mc2/sql-create-context`](https://huggingface.co/datasets/b-mc2/sql-create-context) (~78.6k rows total). Shuffled (seed 42), then a **25k train subset** + a fixed **500 held-out test** set. The remaining ~53k rows were intentionally unused — 1 epoch over 25k already saturates this task; adding the rest gives negligible lift while ~doubling/tripling train time. | | LoRA | r=16, α=16, dropout=0, all linear layers | | Schedule | 1 epoch (~1,557 steps), effective batch 16, lr 2e-4 cosine, warmup 0.03 | | Precision / optim | bf16, `adamw_8bit`, `max_seq_length=2048` | | Loss | completion-only (prompt masked; loss on the SQL answer only) | | Hardware / time | 1× RTX 3090, ~59 min | ## Evaluation methodology Both base and fine-tuned models are evaluated with **identical** prompting and greedy decoding on the **same** held-out 500-example test set (disjoint from train; verified 0 prompt overlap). Semantic equivalence is scored by an independent LLM judge (GLM-5-Turbo). Reporting both exact-match and semantic-equivalence is deliberate: exact-match is a lower bound, semantic-equivalence is the fairer correctness signal. The full harness is in the linked repo (`python -m src.eval`, `python -m src.judge`). ## Limitations - Trained on synthetic, largely single-table schemas (`sql-create-context`, derived from WikiSQL + Spider); not evaluated for SQL-injection safety or complex multi-join queries. - The dataset contains a small amount of label noise (~0.3–0.6% of gold answers are themselves unparseable — WikiSQL artifacts); the model's rare "invalid" outputs are predominantly faithful reproductions of those malformed references rather than novel errors. - Outputs should be validated/parameterized before execution against a real database. ## License & attribution Apache-2.0 (inherited from the base model). Training data: `b-mc2/sql-create-context` (CC-BY-4.0).