Update README.md
Browse files
README.md
CHANGED
|
@@ -1,199 +1,200 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
-
|
| 8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
| 9 |
-
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
## Model Details
|
| 13 |
|
| 14 |
### Model Description
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
| 19 |
|
| 20 |
-
- **Developed by:** [
|
| 21 |
-
- **
|
| 22 |
-
- **
|
| 23 |
-
- **
|
| 24 |
-
- **
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
|
| 28 |
-
### Model Sources
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
- **
|
| 33 |
-
- **Paper [optional]:** [More Information Needed]
|
| 34 |
-
- **Demo [optional]:** [More Information Needed]
|
| 35 |
|
| 36 |
## Uses
|
| 37 |
|
| 38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 39 |
-
|
| 40 |
### Direct Use
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
|
|
|
| 45 |
|
| 46 |
-
### Downstream Use
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
[More Information Needed]
|
| 51 |
|
| 52 |
### Out-of-Scope Use
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
|
| 58 |
## Bias, Risks, and Limitations
|
| 59 |
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
|
|
|
| 63 |
|
| 64 |
### Recommendations
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 69 |
|
| 70 |
## How to Get Started with the Model
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
|
|
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
### Training Data
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
|
|
|
| 83 |
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
##
|
| 89 |
|
| 90 |
-
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
###
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
## Evaluation
|
| 104 |
|
| 105 |
-
<!-- This section describes the evaluation protocols and provides the results. -->
|
| 106 |
-
|
| 107 |
### Testing Data, Factors & Metrics
|
| 108 |
|
| 109 |
#### Testing Data
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
[More Information Needed]
|
| 114 |
-
|
| 115 |
-
#### Factors
|
| 116 |
-
|
| 117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
|
| 121 |
#### Metrics
|
| 122 |
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
|
| 127 |
### Results
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
| 130 |
|
| 131 |
#### Summary
|
| 132 |
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
|
| 141 |
## Environmental Impact
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
|
|
|
|
|
|
| 146 |
|
| 147 |
-
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
-
|
| 153 |
-
## Technical Specifications [optional]
|
| 154 |
|
| 155 |
### Model Architecture and Objective
|
| 156 |
|
| 157 |
-
|
|
|
|
|
|
|
| 158 |
|
| 159 |
### Compute Infrastructure
|
| 160 |
|
| 161 |
-
[More Information Needed]
|
| 162 |
-
|
| 163 |
#### Hardware
|
| 164 |
-
|
| 165 |
-
[More Information Needed]
|
| 166 |
|
| 167 |
#### Software
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 168 |
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
## Citation [optional]
|
| 172 |
-
|
| 173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 174 |
|
| 175 |
**BibTeX:**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
**APA:**
|
| 180 |
-
|
| 181 |
-
[More Information Needed]
|
| 182 |
-
|
| 183 |
-
## Glossary [optional]
|
| 184 |
-
|
| 185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 186 |
-
|
| 187 |
-
[More Information Needed]
|
| 188 |
-
|
| 189 |
-
## More Information [optional]
|
| 190 |
-
|
| 191 |
-
[More Information Needed]
|
| 192 |
-
|
| 193 |
-
## Model Card Authors [optional]
|
| 194 |
|
| 195 |
-
[
|
| 196 |
|
| 197 |
## Model Card Contact
|
| 198 |
|
| 199 |
-
[
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- text-to-sql
|
| 5 |
+
- sql
|
| 6 |
+
- code-generation
|
| 7 |
+
- qlora
|
| 8 |
+
- lora
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# CodeLlama-7b Text-to-SQL (QLoRA Fine-tune)
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
A CodeLlama-7b model fine-tuned on the Spider dataset using QLoRA (4-bit quantization + LoRA adapters) for the task of converting natural language questions into SQL queries.
|
| 14 |
|
| 15 |
## Model Details
|
| 16 |
|
| 17 |
### Model Description
|
| 18 |
|
| 19 |
+
This model is a parameter-efficient fine-tune of CodeLlama-7b using QLoRA on the Spider Text-to-SQL benchmark dataset. It takes a database schema (as DDL statements) and a natural language question as input, and generates the corresponding SQL query.
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
- **Developed by:** [Your Name / Username]
|
| 22 |
+
- **Model type:** Causal Language Model (CodeLlama-7b + LoRA adapters)
|
| 23 |
+
- **Language(s) (NLP):** English
|
| 24 |
+
- **License:** Llama 2 Community License
|
| 25 |
+
- **Finetuned from model:** [codellama/CodeLlama-7b-hf](https://huggingface.co/codellama/CodeLlama-7b-hf)
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
### Model Sources
|
| 28 |
|
| 29 |
+
- **Repository:** [Your HuggingFace repo link]
|
| 30 |
+
- **Base Model:** https://huggingface.co/codellama/CodeLlama-7b-hf
|
| 31 |
+
- **Dataset:** https://huggingface.co/datasets/spider
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Uses
|
| 34 |
|
|
|
|
|
|
|
| 35 |
### Direct Use
|
| 36 |
|
| 37 |
+
This model is intended for converting natural language questions into SQL queries given a database schema. It is suitable for:
|
| 38 |
+
- Building natural language database interfaces
|
| 39 |
+
- SQL query auto-completion tools
|
| 40 |
+
- Educational tools for learning SQL
|
| 41 |
|
| 42 |
+
### Downstream Use
|
| 43 |
|
| 44 |
+
The LoRA adapter can be merged into the base model or used directly with PEFT for further fine-tuning on domain-specific SQL dialects or private schemas.
|
|
|
|
|
|
|
| 45 |
|
| 46 |
### Out-of-Scope Use
|
| 47 |
|
| 48 |
+
- Not suitable for production databases without human review of generated queries
|
| 49 |
+
- Not tested on non-English questions
|
| 50 |
+
- Not designed for NoSQL or non-relational query languages
|
| 51 |
|
| 52 |
## Bias, Risks, and Limitations
|
| 53 |
|
| 54 |
+
- The model is trained only on Spider which covers ~200 databases — it may struggle with unseen schema patterns
|
| 55 |
+
- Generated SQL is not guaranteed to be syntactically or semantically correct
|
| 56 |
+
- The model may hallucinate column or table names not present in the provided schema
|
| 57 |
+
- All column types are inferred as TEXT/INTEGER/REAL — nuanced type handling may be incorrect
|
| 58 |
|
| 59 |
### Recommendations
|
| 60 |
|
| 61 |
+
Always validate generated SQL against the actual database before execution. Do not run generated queries directly on production systems without review.
|
|
|
|
|
|
|
| 62 |
|
| 63 |
## How to Get Started with the Model
|
| 64 |
|
| 65 |
+
```python
|
| 66 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 67 |
+
from peft import PeftModel
|
| 68 |
+
import torch
|
| 69 |
|
| 70 |
+
BASE_MODEL = "codellama/CodeLlama-7b-hf"
|
| 71 |
+
ADAPTER = "your-username/codellama-7b-text2sql" # this repo
|
|
|
|
| 72 |
|
| 73 |
+
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
|
| 74 |
+
base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.float16, device_map="auto")
|
| 75 |
+
model = PeftModel.from_pretrained(base, ADAPTER)
|
| 76 |
|
| 77 |
+
schema = "CREATE TABLE employees (id INTEGER, name TEXT, salary REAL, dept_id INTEGER);\nCREATE TABLE departments (id INTEGER, dept_name TEXT);"
|
| 78 |
+
question = "List all employees in the Engineering department ordered by salary descending."
|
| 79 |
|
| 80 |
+
prompt = (
|
| 81 |
+
f"<s>[INST] You are an expert SQL assistant. "
|
| 82 |
+
f"Given the database schema below, write a SQL query that answers the question.\n\n"
|
| 83 |
+
f"### Schema:\n{schema}\n\n"
|
| 84 |
+
f"### Question:\n{question} [/INST]\n"
|
| 85 |
+
)
|
| 86 |
|
| 87 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 88 |
+
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.1, do_sample=False)
|
| 89 |
+
sql = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
| 90 |
+
print(sql)
|
| 91 |
+
```
|
| 92 |
|
| 93 |
+
## Training Details
|
| 94 |
|
| 95 |
+
### Training Data
|
| 96 |
|
| 97 |
+
- **Dataset:** [Spider](https://yale-lily.github.io/spider) — a large-scale human-labeled Text-to-SQL dataset
|
| 98 |
+
- **Train split:** ~7,000 examples
|
| 99 |
+
- **Validation split:** ~1,034 examples
|
| 100 |
+
- **Filtering:** Examples exceeding 1024 tokens after prompt formatting were excluded
|
| 101 |
|
| 102 |
+
### Training Procedure
|
| 103 |
|
| 104 |
+
#### Preprocessing
|
| 105 |
|
| 106 |
+
Each example was formatted using an Alpaca-style instruction template with the database DDL schema and natural language question as input, and the SQL query as output. SQL queries were normalized by collapsing whitespace and stripping trailing semicolons.
|
| 107 |
|
| 108 |
+
#### Training Hyperparameters
|
| 109 |
|
| 110 |
+
- **Training regime:** fp16 mixed precision
|
| 111 |
+
- **Quantization:** 4-bit NF4 (bitsandbytes)
|
| 112 |
+
- **LoRA rank (r):** 16
|
| 113 |
+
- **LoRA alpha:** 32
|
| 114 |
+
- **LoRA dropout:** 0.05
|
| 115 |
+
- **Target modules:** `q_proj`, `v_proj`
|
| 116 |
+
- **Epochs:** 3
|
| 117 |
+
- **Batch size:** 4 per device + 4 gradient accumulation steps (effective batch = 16)
|
| 118 |
+
- **Learning rate:** 2e-4 with cosine schedule
|
| 119 |
+
- **Warmup ratio:** 0.03
|
| 120 |
+
- **Optimizer:** paged_adamw_32bit
|
| 121 |
+
- **Max sequence length:** 1024
|
| 122 |
+
|
| 123 |
+
#### Speeds, Sizes, Times
|
| 124 |
+
|
| 125 |
+
- **Hardware:** Kaggle T4 x2 GPU
|
| 126 |
+
- **Training time:** ~2.5 hours for 3 epochs
|
| 127 |
+
- **Adapter size:** ~84 MB
|
| 128 |
|
| 129 |
## Evaluation
|
| 130 |
|
|
|
|
|
|
|
| 131 |
### Testing Data, Factors & Metrics
|
| 132 |
|
| 133 |
#### Testing Data
|
| 134 |
|
| 135 |
+
Spider validation split (~1,034 examples across unseen databases).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
#### Metrics
|
| 138 |
|
| 139 |
+
- **Eval loss** monitored during training via `evaluation_strategy="steps"` every 200 steps
|
| 140 |
+
- Best checkpoint selected by lowest eval loss
|
|
|
|
| 141 |
|
| 142 |
### Results
|
| 143 |
|
| 144 |
+
| Metric | Value |
|
| 145 |
+
|------------|--------------|
|
| 146 |
+
| Eval Loss | [Fill after training] |
|
| 147 |
|
| 148 |
#### Summary
|
| 149 |
|
| 150 |
+
The model learns to ground SQL generation in the provided schema, producing syntactically valid queries for common SQL patterns including SELECT, WHERE, JOIN, GROUP BY, and ORDER BY.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
## Environmental Impact
|
| 153 |
|
| 154 |
+
- **Hardware Type:** NVIDIA Tesla T4 x2
|
| 155 |
+
- **Hours used:** ~2.5
|
| 156 |
+
- **Cloud Provider:** Kaggle (Google Cloud)
|
| 157 |
+
- **Compute Region:** us-central1
|
| 158 |
+
- **Carbon Emitted:** ~0.2 kg CO2eq (estimated via [ML Impact Calculator](https://mlco2.github.io/impact#compute))
|
| 159 |
|
| 160 |
+
## Technical Specifications
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
### Model Architecture and Objective
|
| 163 |
|
| 164 |
+
- **Base:** CodeLlama-7b (decoder-only transformer, 7B parameters)
|
| 165 |
+
- **Adapter:** LoRA applied to `q_proj` and `v_proj` attention layers
|
| 166 |
+
- **Objective:** Causal language modeling (next token prediction) on formatted SQL instruction examples
|
| 167 |
|
| 168 |
### Compute Infrastructure
|
| 169 |
|
|
|
|
|
|
|
| 170 |
#### Hardware
|
| 171 |
+
2x NVIDIA Tesla T4 (16 GB VRAM each) on Kaggle free tier
|
|
|
|
| 172 |
|
| 173 |
#### Software
|
| 174 |
+
- Python 3.12
|
| 175 |
+
- PyTorch 2.2.0
|
| 176 |
+
- Transformers 4.40.0
|
| 177 |
+
- PEFT 0.10.0
|
| 178 |
+
- TRL 0.9.6
|
| 179 |
+
- bitsandbytes 0.43.1
|
| 180 |
+
- Accelerate 0.29.3
|
| 181 |
|
| 182 |
+
## Citation
|
|
|
|
|
|
|
|
|
|
|
|
|
| 183 |
|
| 184 |
**BibTeX:**
|
| 185 |
+
```bibtex
|
| 186 |
+
@inproceedings{yu2018spider,
|
| 187 |
+
title = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
|
| 188 |
+
author = {Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and others},
|
| 189 |
+
booktitle = {EMNLP},
|
| 190 |
+
year = {2018}
|
| 191 |
+
}
|
| 192 |
+
```
|
| 193 |
|
| 194 |
+
## Model Card Authors
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
|
| 196 |
+
[Your Name]
|
| 197 |
|
| 198 |
## Model Card Contact
|
| 199 |
|
| 200 |
+
[Your HuggingFace profile or email]
|