File size: 4,345 Bytes
bf7c7f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: apache-2.0
base_model: Qwen/Qwen2.5-7B-Instruct
tags:
  - sql
  - text-to-sql
  - qlora
  - unsloth
  - qwen2.5
  - database
  - natural-language-to-sql
datasets:
  - gretelai/synthetic_text_to_sql
language:
  - en
pipeline_tag: text-generation
library_name: transformers
model-index:
  - name: SQLForge-7B
    results: []
---

# SQLForge-7B

A fine-tuned **Qwen2.5-7B-Instruct** model specialized for **natural language to SQL generation**. Given a database schema and a question in plain English, it writes the correct SQL query and explains what it does.

## Key Details

| | |
|---|---|
| **Base model** | Qwen/Qwen2.5-7B-Instruct |
| **Method** | QLoRA (4-bit NF4, rank 16, alpha 16) |
| **Library** | Unsloth + TRL SFTTrainer |
| **Dataset** | gretelai/synthetic_text_to_sql (10K examples from 100K) |
| **Hardware** | NVIDIA RTX A5000 (24GB VRAM) on RunPod |
| **Training time** | ~2.75 hours (500 steps) |
| **Final loss** | 0.414 |
| **Parameters trained** | 40.4M of 7.66B (0.53%) |
| **Format** | ChatML |
| **Output** | Merged 16-bit safetensors |

## Dataset

Trained on 10,000 examples from the [gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) dataset, which covers 100 domains with a wide range of SQL complexity levels including subqueries, joins, aggregations, window functions, and set operations. Each example includes the database schema (CREATE TABLE statements), a natural language question, the correct SQL query, and an explanation.

## Usage

### Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("sriksven/SQLForge-7B")
tokenizer = AutoTokenizer.from_pretrained("sriksven/SQLForge-7B")

messages = [
    {
        "role": "system",
        "content": "You are an expert SQL assistant. Given a database schema and a natural language question, write the correct SQL query and explain what it does.",
    },
    {
        "role": "user",
        "content": (
            "Schema:\n"
            "CREATE TABLE employees (id INT, name VARCHAR(100), department VARCHAR(50), salary DECIMAL(10,2));\n"
            "CREATE TABLE departments (name VARCHAR(50), budget DECIMAL(12,2));\n\n"
            "Question: What is the average salary by department, only showing departments with average salary above 75000?"
        ),
    },
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Unsloth (faster inference)

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="sriksven/SQLForge-7B",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
```

## SQL Complexity Coverage

The training data includes queries across multiple complexity levels:
- Simple SELECT with WHERE clauses
- Aggregations with GROUP BY and HAVING
- Single and multiple JOINs
- Subqueries and correlated subqueries
- Window functions (ROW_NUMBER, RANK, LAG, LEAD)
- Set operations (UNION, INTERSECT, EXCEPT)
- Data definition (CREATE, ALTER, INSERT)

## Intended Use

- Natural language interfaces to databases
- SQL copilot tools for analysts and developers
- Educational tools for learning SQL
- Prototyping data query systems

## Limitations

- Trained on synthetic data, not real production database queries
- May not handle highly domain-specific or proprietary SQL dialects
- Best with standard SQL syntax (PostgreSQL/MySQL style)
- Does not validate against a live database — SQL correctness is not guaranteed
- Long or deeply nested schemas may exceed the 2048 token context

## Training Infrastructure

| | |
|---|---|
| **GPU** | NVIDIA RTX A5000 24GB |
| **Cloud** | RunPod ($0.27/hr) |
| **Framework** | Unsloth 2026.5.2 + TRL + Transformers 5.5.0 |
| **Precision** | BF16 training, 4-bit NF4 base quantization |
| **Optimizer** | AdamW 8-bit |
| **Learning rate** | 2e-4, linear decay |
| **Batch size** | 16 effective (4 per device × 4 accumulation) |
| **Packing** | Enabled |

## Source Code

Training scripts: [github.com/sriksven/LLM-FineTune-Suite](https://github.com/sriksven/LLM-FineTune-Suite)

## License

Apache 2.0