--- language: - en license: apache-2.0 library_name: transformers tags: - sql - forensics - text-to-sql - llama - fine-tuned base_model: unsloth/Llama-3.2-3B-Instruct datasets: - pawlaszc/mobile-forensics-sql metrics: - accuracy model-index: - name: ForensicSQL-Llama-3.2-3B results: - task: type: text-to-sql name: Text-to-SQL Generation dataset: type: mobile-forensics name: SQLiteDS — Mobile Forensics SQL Dataset (corrected) metrics: - type: accuracy value: 91.0 name: Overall Accuracy (without app name) - type: accuracy value: 95.1 name: Easy Queries Accuracy - type: accuracy value: 87.5 name: Medium Queries Accuracy - type: accuracy value: 88.9 name: Hard Queries Accuracy --- # ForensicSQL-Llama-3.2-3B ## Model Description **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized for generating SQLite queries from natural language requests against mobile forensic databases. The model converts investigative questions into executable SQL queries across a wide range of forensic artifact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat, Instagram, blockchain wallets, and many more. This model was developed as part of a research project and accompanying journal paper investigating LLM fine-tuning for forensic database analysis, and is integrated into [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic analysis tool. > **Key result:** 91.0% execution accuracy on a 100-example held-out test set — within > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions > (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally > with no internet connectivity required. ## Model Details | Property | Value | |---|---| | **Base Model** | meta-llama/Llama-3.2-3B-Instruct | | **Fine-tuning Method** | Full fine-tune (bf16) | | **Training Dataset** | SQLiteDS — 800 training examples, 191 forensic artifact categories | | **Training Framework** | Hugging Face Transformers | | **Best Val Loss** | 0.3043 (7 epochs) | | **Model Size (bf16)** | ~6 GB | | **Hardware Required** | 16 GB unified memory (Apple M-series) or equivalent GPU | ## Performance ### Overall Results (fixed dataset, n=100, best configuration) | Metric | Value | |---|---| | **Overall Accuracy** | **91.0%** (91/100) | | 95% CI (Wilson) | [83.8%, 95.2%] | | Executable Queries | 92/100 | | GPT-4o Accuracy | 95.0% (gap: 4 pp, p ≈ 0.39) | | Base Model (no fine-tuning) | 35.0% | | Improvement over base | +56 pp | ### Accuracy by Query Difficulty | Difficulty | Accuracy | n | 95% CI | vs. GPT-4o | |---|---|---|---|---| | Easy (single-table) | **95.1%** | 39/41 | [83.9%, 98.7%] | 0.0 pp | | Medium (joins, aggregation) | **87.5%** | 28/32 | [71.9%, 95.0%] | 0.0 pp | | Hard (CTEs, window functions) | **88.9%** | 24/27 | [71.9%, 96.1%] | −3.7 pp | ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap is concentrated on Hard queries (complex CTEs, window functions, multi-table joins). ### Accuracy by Forensic Domain | Domain | Accuracy | n | 95% CI | |---|---|---|---| | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] | | Android Artifacts | **94.4%** | 17/18 | [74.2%, 99.0%] | | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] | | iOS CoreData | **84.0%** | 21/25 | [65.3%, 93.6%] | | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] | ### Prompt Configuration Ablation | Configuration | Overall | Easy | Medium | Hard | iOS | |---|---|---|---|---|---| | **WITHOUT App Name** ★ | **91.0%** | **95.1%** | 87.5% | **88.9%** | 84.0% | | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** | ★ Primary configuration — omitting the application name from the prompt yields 3 pp higher overall accuracy. Interestingly, including the app name helps iOS CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary configuration without app name is recommended for general use. ### Post-Processing Pipeline Contribution | Component | Queries saved | |---|---| | Execution feedback (retry) | 7 | | Alias normalization | 18 | | Column corrections (Levenshtein) | 2 | ### Training Progression | Configuration | Val Loss | Accuracy | Δ | |---|---|---|---| | Base model (no fine-tuning) | — | 35.0% | — | | Fine-tuned, no augmentation | — | 68.0% | +33 pp | | + Data augmentation (3.4×) | — | 74.0% | +6 pp | | + Extended training (7 epochs) | 0.3617 | 84.0% | +10 pp | | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp | | + Execution feedback | 0.3617 | 90.0% | +3 pp | | + Corrected training dataset (v5) | **0.3043** | **91.0%** | +1 pp | ## Intended Use ### Primary Use Cases - Mobile forensics investigations: automated SQL query drafting against seized device databases - Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows) - Research in domain-specific Text-to-SQL - Educational use for learning forensic database analysis ### Important: This Model is a Drafting Assistant > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries > that require review by a practitioner with sufficient SQL knowledge before any reliance > is placed on their results. The 91.0% accuracy means approximately **1 in 11 queries > contains an error**. In court-admissible or case-critical work, all outputs must be > independently validated. ### Out-of-Scope Use - Autonomous forensic decision-making without human review - Production systems requiring >95% guaranteed accuracy - General-purpose SQL generation outside the forensic domain - Non-SQLite databases (PostgreSQL, MySQL, etc.) ## How to Use ### Quick Start (Transformers) ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) model.eval() schema = """ CREATE TABLE message ( ROWID INTEGER PRIMARY KEY, text TEXT, handle_id INTEGER, date INTEGER, is_from_me INTEGER, cache_has_attachments INTEGER ); CREATE TABLE handle ( ROWID INTEGER PRIMARY KEY, id TEXT, service TEXT ); """ request = "Find all messages received in the last 7 days that contain attachments" # Note: do NOT use apply_chat_template — use plain-text prompt prompt = f"""Generate a valid SQLite query for this forensic database request. Database Schema: {schema} Request: {request} SQLite Query: """ inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=300, do_sample=False, # greedy decoding — do not change ) input_length = inputs['input_ids'].shape[1] sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True) print(sql.strip()) ``` > **Important:** Use plain-text tokenization (do **not** call `apply_chat_template`). > The model was trained and evaluated with a plain-text prompt format. > Use `do_sample=False` (greedy decoding) for reproducible results. ### Python Helper Class ```python class ForensicSQLGenerator: def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"): from transformers import AutoModelForCausalLM, AutoTokenizer import torch self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) self.model.eval() def generate_sql(self, schema: str, request: str) -> str: prompt = ( "Generate a valid SQLite query for this forensic database request.\n\n" f"Database Schema:\n{schema}\n\n" f"Request: {request}\n\n" "SQLite Query:\n" ) inputs = self.tokenizer( prompt, return_tensors="pt", truncation=True, max_length=2048 ) inputs = {k: v.to(self.model.device) for k, v in inputs.items()} input_length = inputs["input_ids"].shape[1] with torch.no_grad(): outputs = self.model.generate( **inputs, max_new_tokens=300, do_sample=False ) sql = self.tokenizer.decode( outputs[0][input_length:], skip_special_tokens=True ) # Return first statement only, normalized return sql.strip().split("\n")[0].strip().rstrip(";") + ";" # Usage generator = ForensicSQLGenerator() sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours") print(sql) ``` ### With Ollama / llama.cpp (GGUF) ```bash # With llama.cpp ./llama-cli -m forensic-sql-q4_k_m.gguf \ --temp 0 \ -p "Generate a valid SQLite query for this forensic database request. Database Schema: CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER); Request: Find all messages sent after midnight SQLite Query:" # With Ollama — create a Modelfile cat > Modelfile << 'EOF' FROM ./forensic-sql-q4_k_m.gguf PARAMETER temperature 0 PARAMETER num_predict 300 EOF ollama create forensic-sql -f Modelfile ollama run forensic-sql ``` ## Training Details ### Dataset — SQLiteDS - **Total examples:** 1,000 (800 train / 100 val / 100 test), fixed random seed 42 - **Forensic artifact categories:** 191 - **Reference query validation:** All 1,000 reference queries validated for execution correctness against in-memory SQLite; 50 queries (5%) corrected before final training - **Augmentation:** 3.4× expansion via instruction paraphrasing, WHERE clause reordering, and LIMIT injection — augmented examples confined to training split only - **Dataset:** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql) - **License:** CC BY 4.0 ### Hyperparameters | Parameter | Value | |---|---| | Training method | Full fine-tune (no LoRA) | | Precision | bfloat16 | | Epochs | 7 | | Learning rate | 2e-5 (peak) | | LR scheduler | Cosine with warmup | | Batch size | 1 + gradient accumulation 4 | | Max sequence length | 2048 | | Optimizer | AdamW | | Hardware | Apple M-series, 16 GB unified memory | | Training time | ~17.6 hours | | Best val loss | 0.3043 (epoch 7) | ### Key Training Insight: Sequence Length Early training runs with `max_seq_length=512` truncated 92% of examples, causing the model to learn schema generation (CREATE TABLE) instead of queries — resulting in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and improved accuracy from 50% to 68% before augmentation, and to 91% after all training components were applied. ## Limitations ### Known Issues 1. **iOS CoreData Schemas (84.0%):** The Z-prefix column naming convention (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column names alone, making these schemas harder to reason about. 2. **Hard Queries — 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries, and window functions are the primary remaining challenge. 3. **Finance & Crypto (81.8%, n=11):** Small test set; confidence intervals are wide. Interpret with caution. 4. **~1 in 11 error rate:** Approximately 9% of generated queries will contain errors. Expert review of all outputs is required before use in investigations. ### When Human Review is Especially Important - Complex multi-table queries with CTEs or window functions - Case-critical or court-admissible investigations - Any query that will be used to draw conclusions about a suspect - Queries involving rare or unusual forensic artifact schemas ## Evaluation - **Test set:** 100 examples, held-out, seed=42, non-augmented - **Metric:** Execution accuracy — query is correct iff it executes without error AND returns a result set identical to the reference query - **Reference validation:** All reference queries validated for execution correctness before evaluation; 5 broken queries in the test set were corrected - **Evaluation script:** Available in the dataset repository on Zenodo ([DOI]) ## Citation If you use this model or the SQLiteDS dataset in your research, please cite: ```bibtex @article{pawlaszczyk2026forsqlitelm, author = {Dirk Pawlaszczyk}, title = {AI-Based Automated SQL Query Generation for SQLite Databases in Mobile Forensics}, journal = {Forensic Science International: Digital Investigation}, year = {2026}, note = {FSIDI-D-26-00029} } ``` ## License Apache 2.0 — following the base Llama 3.2 license terms. ## Acknowledgments - Base model: Meta's Llama 3.2-3B-Instruct - Training framework: Hugging Face Transformers - Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite) - Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses) ## Additional Resources - **Dataset (Zenodo):** [SQLiteDS — DOI to be added on publication] - **Dataset (HuggingFace):** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql) - **FQLite integration:** [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite) - **Paper:** FSIDI-D-26-00029 (under review) --- **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use. All generated SQL queries must be reviewed by a qualified practitioner before execution in live forensic investigations. The authors accept no liability for incorrect conclusions drawn from unvalidated model outputs.