| | --- |
| | widget: |
| | - text: >- |
| | sql_prompt: What is the monthly voice usage for each customer in the Mumbai |
| | region? sql_context: CREATE TABLE customers (customer_id INT, name |
| | VARCHAR(50), voice_usage_minutes FLOAT, region VARCHAR(50)); INSERT INTO |
| | customers (customer_id, name, voice_usage_minutes, region) VALUES (1, 'Aarav |
| | Patel', 500, 'Mumbai'), (2, 'Priya Shah', 700, 'Mumbai'); |
| | example_title: Example1 |
| | - text: >- |
| | sql_prompt: How many wheelchair accessible vehicles are there in the 'Train' |
| | mode of transport? sql_context: CREATE TABLE Vehicles(vehicle_id INT, |
| | vehicle_type VARCHAR(20), mode_of_transport VARCHAR(20), |
| | is_wheelchair_accessible BOOLEAN); INSERT INTO Vehicles(vehicle_id, |
| | vehicle_type, mode_of_transport, is_wheelchair_accessible) VALUES (1, |
| | 'Train_Car', 'Train', TRUE), (2, 'Train_Engine', 'Train', FALSE), (3, 'Bus', |
| | 'Bus', TRUE); |
| | example_title: Example2 |
| | - text: >- |
| | sql_prompt: Which economic diversification efforts in the 'diversification' |
| | table have a higher budget than the average budget for all economic |
| | diversification efforts in the 'budget' table? sql_context: CREATE TABLE |
| | diversification (id INT, effort VARCHAR(50), budget FLOAT); CREATE TABLE |
| | budget (diversification_id INT, diversification_effort VARCHAR(50), amount |
| | FLOAT); |
| | example_title: Example3 |
| | language: |
| | - en |
| | datasets: |
| | - gretelai/synthetic_text_to_sql |
| | metrics: |
| | - rouge |
| | library_name: transformers |
| | base_model: facebook/bart-large-cnn |
| | model-index: |
| | - name: SwastikM/bart-large-nl2sql |
| | results: |
| | - task: |
| | type: text2text-generation |
| | dataset: |
| | name: gretelai/synthetic_text_to_sql |
| | type: gretelai/synthetic_text_to_sql |
| | split: train, test |
| | metrics: |
| | - name: ROUGE-1 |
| | type: rouge |
| | value: 55.69 |
| | verified: true |
| | - name: ROUGE-2 |
| | type: rouge |
| | value: 42.99 |
| | verified: true |
| | - name: ROUGE-L |
| | type: rouge |
| | value: 51.43 |
| | verified: true |
| | - name: ROUGE-LSUM |
| | type: rouge |
| | value: 51.4 |
| | verified: true |
| | github: https://github.com/swastikmaiti/SwastikM-bart-large-nl2sql.git |
| |
|
| | co2_eq_emissions: |
| | emissions: 160 |
| | source: ML CO2 Impact https://mlco2.github.io/impact/#home) |
| | training_type: fine-tuning |
| | hardware_used: TESLA P100 |
| | |
| | tags: |
| | - natural language |
| | - SQL |
| | - text2sql |
| | - nl2sql |
| | --- |
| | |
| |
|
| | # BART-LARGE-CNN fine-tuned on SYNTHETIC_TEXT_TO_SQL |
| | |
| | Generate SQL query from Natural Language question with a SQL context. |
| | |
| | |
| | ## Model Details |
| | |
| | ### Model Description |
| | |
| | BART from facebook/bart-large-cnn is fintuned on gretelai/synthetic_text_to_sql dataset to generate SQL from NL and SQL context |
| |
|
| |
|
| | - **Model type:** BART |
| | - **Language(s) (NLP):** English |
| | - **License:** openrail |
| | - **Finetuned from model [facebook/bart-large-cnn](https://huggingface.co/facebook/bart-large-cnn?text=The+tower+is+324+metres+%281%2C063+ft%29+tall%2C+about+the+same+height+as+an+81-storey+building%2C+and+the+tallest+structure+in+Paris.+Its+base+is+square%2C+measuring+125+metres+%28410+ft%29+on+each+side.+During+its+construction%2C+the+Eiffel+Tower+surpassed+the+Washington+Monument+to+become+the+tallest+man-made+structure+in+the+world%2C+a+title+it+held+for+41+years+until+the+Chrysler+Building+in+New+York+City+was+finished+in+1930.+It+was+the+first+structure+to+reach+a+height+of+300+metres.+Due+to+the+addition+of+a+broadcasting+aerial+at+the+top+of+the+tower+in+1957%2C+it+is+now+taller+than+the+Chrysler+Building+by+5.2+metres+%2817+ft%29.+Excluding+transmitters%2C+the+Eiffel+Tower+is+the+second+tallest+free-standing+structure+in+France+after+the+Millau+Viaduct.)** |
| | - **Dataset:** [gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) |
| |
|
| | ## Intended uses & limitations |
| |
|
| | Addressing the power of LLM in fintuned downstream task. Implemented as a personal Project. |
| |
|
| | ### How to use |
| |
|
| | ```python |
| | query_question_with_context = """sql_prompt: Which economic diversification efforts in |
| | the 'diversification' table have a higher budget than the average budget for all economic diversification efforts in the 'budget' table? |
| | sql_context: CREATE TABLE diversification (id INT, effort VARCHAR(50), budget FLOAT); CREATE TABLE |
| | budget (diversification_id INT, diversification_effort VARCHAR(50), amount FLOAT);""" |
| | ``` |
| |
|
| | # Use a pipeline as a high-level helper |
| | ```python |
| | from transformers import pipeline |
| | |
| | sql_generator = pipeline("text2text-generation", model="SwastikM/bart-large-nl2sql") |
| | |
| | sql = sql_generator(query_question_with_context)[0]['generated_text'] |
| | |
| | print(sql) |
| | ``` |
| |
|
| | # Load model directly |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("SwastikM/bart-large-nl2sql") |
| | model = AutoModelForSeq2SeqLM.from_pretrained("SwastikM/bart-large-nl2sql") |
| | |
| | inputs = tokenizer(query_question_with_context, return_tensors="pt").input_ids |
| | outputs = model.generate(inputs, max_new_tokens=100, do_sample=False) |
| | |
| | sql = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(sql) |
| | ``` |
| |
|
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | [gretelai/synthetic_text_to_sql](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql) |
| |
|
| | ### Training Procedure |
| |
|
| | HuggingFace Accelerate with Training Loop. |
| |
|
| | #### Preprocessing |
| |
|
| | - ***Encoder Input:*** "sql_prompt: " + data['sql_prompt']+" sql_context: "+data['sql_context'] |
| | - ***Decoder Input:*** data['sql'] |
| |
|
| |
|
| | #### Training Hyperparameters |
| |
|
| | - **Optimizer:** AdamW |
| | - **lr:** 2e-5 |
| | - **decay:** linear |
| | - **num_warmup_steps:** 0 |
| | - **batch_size:** 8 |
| | - **num_training_steps:** 12500 |
| | |
| | |
| | #### Hardware |
| | |
| | - **GPU:** P100 |
| | |
| | |
| | ### Citing Dataset and BaseModel |
| | |
| | ``` |
| | @software{gretel-synthetic-text-to-sql-2024, |
| | author = {Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew}, |
| | title = {{Synthetic-Text-To-SQL}: A synthetic dataset for training language models to generate SQL queries from natural language prompts}, |
| | month = {April}, |
| | year = {2024}, |
| | url = {https://huggingface.co/datasets/gretelai/synthetic-text-to-sql} |
| | } |
| | ``` |
| | |
| | ``` |
| | @article{DBLP:journals/corr/abs-1910-13461, |
| | author = {Mike Lewis and |
| | Yinhan Liu and |
| | Naman Goyal and |
| | Marjan Ghazvininejad and |
| | Abdelrahman Mohamed and |
| | Omer Levy and |
| | Veselin Stoyanov and |
| | Luke Zettlemoyer}, |
| | title = {{BART:} Denoising Sequence-to-Sequence Pre-training for Natural Language |
| | Generation, Translation, and Comprehension}, |
| | journal = {CoRR}, |
| | volume = {abs/1910.13461}, |
| | year = {2019}, |
| | url = {http://arxiv.org/abs/1910.13461}, |
| | eprinttype = {arXiv}, |
| | eprint = {1910.13461}, |
| | timestamp = {Thu, 31 Oct 2019 14:02:26 +0100}, |
| | biburl = {https://dblp.org/rec/journals/corr/abs-1910-13461.bib}, |
| | bibsource = {dblp computer science bibliography, https://dblp.org} |
| | } |
| | |
| | ``` |
| | |
| | ## Additional Information |
| | |
| | - ***Github:*** [Repository](https://github.com/swastikmaiti/SwastikM-bart-large-nl2sql.git) |
| | |
| | ## Acknowledgment |
| | |
| | Thanks to [@AI at Meta](https://huggingface.co/facebook) for adding the Pre Trained Model. |
| | Thanks to [@Gretel.ai](https://huggingface.co/gretelai) for adding the datset. |
| | |
| | |
| | ## Model Card Authors |
| | |
| | Swastik Maiti |