README.md · jaykv/tally-sqlcoder-finetuned at main

tally-sqlcoder-finetuned / README.md

jaykv

Upload README.md with huggingface_hub

19d65fc verified 6 days ago

preview code

raw

history blame contribute delete

8.49 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-to-sql
	- sql
	- tally
	- accounting
	- erp
	- llama-3
	- sqlcoder
	- postgresql
	- finance
	base_model: defog/llama-3-sqlcoder-8b
	library_name: peft
	pipeline_tag: text-generation
	---

	# Tally SQLCoder - Fine-tuned for TallyPrime ERP

	Created by: Jay Viramgami

	A fine-tuned LLaMA 3 SQLCoder model specialized for converting natural language questions to PostgreSQL queries for TallyPrime ERP systems.

	## 🎯 Model Description

	This model is specifically trained to understand accounting and business terminology used in TallyPrime ERP and generate accurate SQL queries for a PostgreSQL database migrated from Tally.

	### Key Features

	- 🏦 Accounting Domain Expertise - Understands financial terms, GST, vouchers, ledgers
	- 📊 28 Database Tables - Covers all master and transaction tables from Tally
	- 🎓 ICAI Compliant - Based on Indian accounting standards
	- 🚀 Fast Inference - Optimized with QLoRA for efficient deployment
	- 💯 High Accuracy - Fine-tuned on 5,000+ Tally-specific query pairs

	### Use Cases

	- Customer receivables and vendor payables analysis
	- Sales and purchase reporting
	- Inventory and stock management queries
	- GST and tax compliance reports
	- Financial statements (Profit & Loss, Balance Sheet)
	- Voucher and transaction searches

	## 📊 Model Details

	- Base Model: [defog/llama-3-sqlcoder-8b](https://huggingface.co/defog/llama-3-sqlcoder-8b)
	- Fine-tuning Method: QLoRA (4-bit quantization)
	- Training Data: 5,000 synthetic Tally accounting text-to-SQL pairs
	- Target Database: PostgreSQL (Tally migration schema with 28 tables)
	- Training Platform: Kaggle (NVIDIA T4 GPU)
	- Training Time: ~4 hours
	- Final Training Loss: 0.05-0.07

	### Query Categories Supported

	\| Category \| Templates \| Examples \|
	\|----------\|-----------\|----------\|
	\| Simple Filters \| 20 \| "Show all customers", "List bank accounts" \|
	\| Date Ranges \| 20 \| "Sales in March 2024", "Payments this quarter" \|
	\| Aggregations \| 25 \| "Total sales amount", "Top 10 customers" \|
	\| Joins \| 25 \| "Customer-wise outstanding", "Item sales by godown" \|
	\| Accounting \| 20 \| "P&L items", "Assets and liabilities" \|
	\| GST/Tax \| 15 \| "GST collected", "TDS deducted" \|
	\| Inventory \| 15 \| "Stock movements", "Items with zero balance" \|
	\| Financial Statements \| 10 \| "Trial balance", "Balance sheet data" \|

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers peft torch accelerate bitsandbytes
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	# Load base model
	base_model = AutoModelForCausalLM.from_pretrained(
	"defog/llama-3-sqlcoder-8b",
	device_map="auto",
	torch_dtype=torch.float16
	)

	# Load fine-tuned adapter
	model = PeftModel.from_pretrained(base_model, "jaykv/tally-sqlcoder-finetuned")
	tokenizer = AutoTokenizer.from_pretrained("jaykv/tally-sqlcoder-finetuned")

	# Generate SQL
	question = "Show all customers with outstanding balance above 50000"
	schema = """CREATE TABLE mst_ledger (
	name VARCHAR(1024),
	parent VARCHAR(1024),
	closing_balance DECIMAL(17,2)
	);"""

	prompt = f"""### Task
	Generate a SQL query to answer [QUESTION]{question}[/QUESTION]

	### Database Schema
	The query will run on a database with the following schema:
	{schema}

	### Answer
	Given the database schema, here is the SQL query that answers [QUESTION]{question}[/QUESTION]
	[SQL]"""

	inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048).to(model.device)

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=300,
	temperature=0.1,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	result = tokenizer.decode(outputs[0], skip_special_tokens=True)
	sql = result.split("[SQL]")[-1].strip()
	print(sql)
	```

	Output:
	```sql
	SELECT name, closing_balance
	FROM mst_ledger
	WHERE parent = 'Sundry Debtors'
	AND closing_balance > 50000
	```

	## 📝 Example Queries

	### Simple Customer Query
	Question: "Show all customers"
	Generated SQL:
	```sql
	SELECT name FROM mst_ledger WHERE parent = 'Sundry Debtors'
	```

	### Sales Analysis
	Question: "What is the total sales amount for March 2024?"
	Generated SQL:
	```sql
	SELECT SUM(ABS(amount)) as total_sales
	FROM trn_accounting ta
	JOIN trn_voucher tv ON ta.guid = tv.guid
	WHERE tv.voucher_type = 'Sales'
	AND tv.date BETWEEN '2024-03-01' AND '2024-03-31'
	```

	### Top Customers
	Question: "Show top 10 customers by sales"
	Generated SQL:
	```sql
	SELECT tv.party_name, SUM(ABS(ta.amount)) as total_sales
	FROM trn_voucher tv
	JOIN trn_accounting ta ON tv.guid = ta.guid
	WHERE tv.voucher_type = 'Sales'
	GROUP BY tv.party_name
	ORDER BY total_sales DESC
	LIMIT 10
	```

	### GST Query
	Question: "Show GST payable amount"
	Generated SQL:
	```sql
	SELECT name, closing_balance
	FROM mst_ledger
	WHERE parent = 'Duties & Taxes'
	AND name LIKE '%GST%'
	```

	## 🗄️ Database Schema

	The model is trained on a PostgreSQL schema with 28 tables from TallyPrime:

	### Master Tables (15)
	- `mst_ledger` - Customers, vendors, banks, expenses, incomes
	- `mst_group` - Account group hierarchy
	- `mst_stock_item` - Inventory items with GST details
	- `mst_stock_group` - Stock categories
	- `mst_vouchertype` - Voucher type definitions
	- `mst_godown` - Warehouse locations
	- `mst_cost_centre` - Cost centers
	- And 8 more...

	### Transaction Tables (13)
	- `trn_voucher` - All financial transactions
	- `trn_accounting` - Ledger-wise entries
	- `trn_inventory` - Item-wise stock movements
	- `trn_bill` - Bill allocations
	- `trn_bank` - Bank transaction details
	- And 8 more...

	## 📈 Training Details

	### Dataset
	- Size: 5,000 text-to-SQL pairs
	- Source: Synthetically generated using 150 query templates
	- Split: 90/10 train/test
	- Categories: 8 query types covering all Tally operations

	### Training Configuration
	- Method: QLoRA (Quantized Low-Rank Adaptation)
	- Quantization: 4-bit (NF4)
	- LoRA Rank: 16
	- LoRA Alpha: 32
	- Target Modules: q_proj, k_proj, v_proj, o_proj
	- Batch Size: 2 per device
	- Gradient Accumulation: 4 steps
	- Learning Rate: 2e-4
	- Epochs: ~3 (1,600 steps)
	- Optimizer: PagedAdamW 8-bit
	- Max Sequence Length: 2048 tokens

	### Hardware
	- Platform: Kaggle Notebooks
	- GPU: NVIDIA T4 (16GB)
	- Training Time: ~4 hours

	## 📊 Performance

	- Valid SQL Syntax: >95%
	- Keyword Match: >85%
	- Exact Match (normalized): >70%

	## ⚠️ Limitations

	- Tally-Specific: Optimized for TallyPrime PostgreSQL schema
	- PostgreSQL Only: SQL generated for PostgreSQL dialect
	- Schema Required: Needs database schema in the prompt
	- Context Window: Limited to 2048 tokens
	- Custom Schemas: May require additional fine-tuning for non-Tally schemas

	## 🔧 Deployment Tips

	### For Production Use:
	1. Add validation - Verify generated SQL before execution
	2. Read-only mode - Restrict to SELECT queries only
	3. Query timeout - Set execution time limits
	4. Error handling - Catch and handle syntax errors
	5. Logging - Track all queries for audit

	### Optimization:
	- Use GPU for faster inference (2-3 seconds per query)
	- CPU inference works but is slower (~10-15 seconds)
	- Consider caching frequently asked queries

	## 📄 License

	This model is released under the Apache 2.0 license, inheriting from the base model.

	## 🙏 Acknowledgments

	- Base Model: [defog/llama-3-sqlcoder-8b](https://huggingface.co/defog/llama-3-sqlcoder-8b) by Defog.ai
	- Training Standards: ICAI (Institute of Chartered Accountants of India) Foundation Course
	- Platform: Trained on Kaggle's free GPU infrastructure

	## 📧 Contact

	Author: Jay Viramgami

	For questions, feedback, or collaboration inquiries, please open an issue on the model's discussion page.

	## 🔗 Related Resources

	- [TallyPrime ERP](https://tallysolutions.com/)
	- [Defog SQLCoder](https://github.com/defog-ai/sqlcoder)
	- [PEFT Library](https://github.com/huggingface/peft)

	---

	## Citation

	If you use this model in your work, please cite:

	```bibtex
	@misc{tally-sqlcoder-finetuned,
	author = {Jay Viramgami},
	title = {Tally SQLCoder - Fine-tuned for TallyPrime ERP},
	year = {2024},
	publisher = {HuggingFace},
	url = {https://huggingface.co/jaykv/tally-sqlcoder-finetuned}
	}
	```

	---

	Model Card created by Jay Viramgami \| March 2024