| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: transformers |
| | tags: |
| | - sql |
| | - forensics |
| | - text-to-sql |
| | - llama |
| | - fine-tuned |
| | base_model: unsloth/Llama-3.2-3B-Instruct |
| | datasets: |
| | - pawlaszc/mobile-forensics-sql |
| | metrics: |
| | - accuracy |
| | model-index: |
| | - name: ForensicSQL-Llama-3.2-3B |
| | results: |
| | - task: |
| | type: text-to-sql |
| | name: Text-to-SQL Generation |
| | dataset: |
| | type: mobile-forensics |
| | name: SQLiteDS — Mobile Forensics SQL Dataset (corrected) |
| | metrics: |
| | - type: accuracy |
| | value: 91.0 |
| | name: Overall Accuracy (without app name) |
| | - type: accuracy |
| | value: 95.1 |
| | name: Easy Queries Accuracy |
| | - type: accuracy |
| | value: 87.5 |
| | name: Medium Queries Accuracy |
| | - type: accuracy |
| | value: 88.9 |
| | name: Hard Queries Accuracy |
| | --- |
| | |
| | # ForensicSQL-Llama-3.2-3B |
| |
|
| | ## Model Description |
| |
|
| | **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized |
| | for generating SQLite queries from natural language requests against mobile forensic databases. |
| | The model converts investigative questions into executable SQL queries across a wide range of |
| | forensic artifact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat, |
| | Instagram, blockchain wallets, and many more. |
| |
|
| | This model was developed as part of a research project and accompanying journal paper |
| | investigating LLM fine-tuning for forensic database analysis, and is integrated into |
| | [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic |
| | analysis tool. |
| |
|
| | > **Key result:** 91.0% execution accuracy on a 100-example held-out test set — within |
| | > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions |
| | > (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally |
| | > with no internet connectivity required. |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |---|---| |
| | | **Base Model** | meta-llama/Llama-3.2-3B-Instruct | |
| | | **Fine-tuning Method** | Full fine-tune (bf16) | |
| | | **Training Dataset** | SQLiteDS — 800 training examples, 191 forensic artifact categories | |
| | | **Training Framework** | Hugging Face Transformers | |
| | | **Best Val Loss** | 0.3043 (7 epochs) | |
| | | **Model Size (bf16)** | ~6 GB | |
| | | **Hardware Required** | 16 GB unified memory (Apple M-series) or equivalent GPU | |
| |
|
| | ## Performance |
| |
|
| | ### Overall Results (fixed dataset, n=100, best configuration) |
| |
|
| | | Metric | Value | |
| | |---|---| |
| | | **Overall Accuracy** | **91.0%** (91/100) | |
| | | 95% CI (Wilson) | [83.8%, 95.2%] | |
| | | Executable Queries | 92/100 | |
| | | GPT-4o Accuracy | 95.0% (gap: 4 pp, p ≈ 0.39) | |
| | | Base Model (no fine-tuning) | 35.0% | |
| | | Improvement over base | +56 pp | |
| |
|
| | ### Accuracy by Query Difficulty |
| |
|
| | | Difficulty | Accuracy | n | 95% CI | vs. GPT-4o | |
| | |---|---|---|---|---| |
| | | Easy (single-table) | **95.1%** | 39/41 | [83.9%, 98.7%] | 0.0 pp | |
| | | Medium (joins, aggregation) | **87.5%** | 28/32 | [71.9%, 95.0%] | 0.0 pp | |
| | | Hard (CTEs, window functions) | **88.9%** | 24/27 | [71.9%, 96.1%] | −3.7 pp | |
| |
|
| | ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap |
| | is concentrated on Hard queries (complex CTEs, window functions, multi-table joins). |
| |
|
| | ### Accuracy by Forensic Domain |
| |
|
| | | Domain | Accuracy | n | 95% CI | |
| | |---|---|---|---| |
| | | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] | |
| | | Android Artifacts | **94.4%** | 17/18 | [74.2%, 99.0%] | |
| | | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] | |
| | | iOS CoreData | **84.0%** | 21/25 | [65.3%, 93.6%] | |
| | | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] | |
| |
|
| | ### Prompt Configuration Ablation |
| |
|
| | | Configuration | Overall | Easy | Medium | Hard | iOS | |
| | |---|---|---|---|---|---| |
| | | **WITHOUT App Name** ★ | **91.0%** | **95.1%** | 87.5% | **88.9%** | 84.0% | |
| | | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** | |
| |
|
| | ★ Primary configuration — omitting the application name from the prompt yields |
| | 3 pp higher overall accuracy. Interestingly, including the app name helps iOS |
| | CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary |
| | configuration without app name is recommended for general use. |
| |
|
| | ### Post-Processing Pipeline Contribution |
| |
|
| | | Component | Queries saved | |
| | |---|---| |
| | | Execution feedback (retry) | 7 | |
| | | Alias normalization | 18 | |
| | | Column corrections (Levenshtein) | 2 | |
| |
|
| | ### Training Progression |
| |
|
| | | Configuration | Val Loss | Accuracy | Δ | |
| | |---|---|---|---| |
| | | Base model (no fine-tuning) | — | 35.0% | — | |
| | | Fine-tuned, no augmentation | — | 68.0% | +33 pp | |
| | | + Data augmentation (3.4×) | — | 74.0% | +6 pp | |
| | | + Extended training (7 epochs) | 0.3617 | 84.0% | +10 pp | |
| | | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp | |
| | | + Execution feedback | 0.3617 | 90.0% | +3 pp | |
| | | + Corrected training dataset (v5) | **0.3043** | **91.0%** | +1 pp | |
| |
|
| | ## Intended Use |
| |
|
| | ### Primary Use Cases |
| | - Mobile forensics investigations: automated SQL query drafting against seized device databases |
| | - Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows) |
| | - Research in domain-specific Text-to-SQL |
| | - Educational use for learning forensic database analysis |
| |
|
| | ### Important: This Model is a Drafting Assistant |
| |
|
| | > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries |
| | > that require review by a practitioner with sufficient SQL knowledge before any reliance |
| | > is placed on their results. The 91.0% accuracy means approximately **1 in 11 queries |
| | > contains an error**. In court-admissible or case-critical work, all outputs must be |
| | > independently validated. |
| |
|
| | ### Out-of-Scope Use |
| | - Autonomous forensic decision-making without human review |
| | - Production systems requiring >95% guaranteed accuracy |
| | - General-purpose SQL generation outside the forensic domain |
| | - Non-SQLite databases (PostgreSQL, MySQL, etc.) |
| |
|
| | ## How to Use |
| |
|
| | ### Quick Start (Transformers) |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | import torch |
| | |
| | model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto" |
| | ) |
| | model.eval() |
| | |
| | schema = """ |
| | CREATE TABLE message ( |
| | ROWID INTEGER PRIMARY KEY, |
| | text TEXT, |
| | handle_id INTEGER, |
| | date INTEGER, |
| | is_from_me INTEGER, |
| | cache_has_attachments INTEGER |
| | ); |
| | CREATE TABLE handle ( |
| | ROWID INTEGER PRIMARY KEY, |
| | id TEXT, |
| | service TEXT |
| | ); |
| | """ |
| | |
| | request = "Find all messages received in the last 7 days that contain attachments" |
| | |
| | # Note: do NOT use apply_chat_template — use plain-text prompt |
| | prompt = f"""Generate a valid SQLite query for this forensic database request. |
| | |
| | Database Schema: |
| | {schema} |
| | |
| | Request: {request} |
| | |
| | SQLite Query: |
| | """ |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048) |
| | inputs = {k: v.to(model.device) for k, v in inputs.items()} |
| | |
| | with torch.no_grad(): |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=300, |
| | do_sample=False, # greedy decoding — do not change |
| | ) |
| | |
| | input_length = inputs['input_ids'].shape[1] |
| | sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True) |
| | print(sql.strip()) |
| | ``` |
| |
|
| | > **Important:** Use plain-text tokenization (do **not** call `apply_chat_template`). |
| | > The model was trained and evaluated with a plain-text prompt format. |
| | > Use `do_sample=False` (greedy decoding) for reproducible results. |
| | |
| | ### Python Helper Class |
| | |
| | ```python |
| | class ForensicSQLGenerator: |
| | def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"): |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | import torch |
| | |
| | self.tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | self.model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto" |
| | ) |
| | self.model.eval() |
| | |
| | def generate_sql(self, schema: str, request: str) -> str: |
| | prompt = ( |
| | "Generate a valid SQLite query for this forensic database request.\n\n" |
| | f"Database Schema:\n{schema}\n\n" |
| | f"Request: {request}\n\n" |
| | "SQLite Query:\n" |
| | ) |
| | inputs = self.tokenizer( |
| | prompt, return_tensors="pt", truncation=True, max_length=2048 |
| | ) |
| | inputs = {k: v.to(self.model.device) for k, v in inputs.items()} |
| | input_length = inputs["input_ids"].shape[1] |
| | |
| | with torch.no_grad(): |
| | outputs = self.model.generate( |
| | **inputs, max_new_tokens=300, do_sample=False |
| | ) |
| | |
| | sql = self.tokenizer.decode( |
| | outputs[0][input_length:], skip_special_tokens=True |
| | ) |
| | # Return first statement only, normalized |
| | return sql.strip().split("\n")[0].strip().rstrip(";") + ";" |
| | |
| |
|
| | # Usage |
| | generator = ForensicSQLGenerator() |
| | sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours") |
| | print(sql) |
| | ``` |
| | |
| | ### With Ollama / llama.cpp (GGUF) |
| | |
| | ```bash |
| | # With llama.cpp |
| | ./llama-cli -m forensic-sql-q4_k_m.gguf \ |
| | --temp 0 \ |
| | -p "Generate a valid SQLite query for this forensic database request. |
| | |
| | Database Schema: |
| | CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER); |
| |
|
| | Request: Find all messages sent after midnight |
| |
|
| | SQLite Query:" |
| |
|
| | # With Ollama — create a Modelfile |
| | cat > Modelfile << 'EOF' |
| | FROM ./forensic-sql-q4_k_m.gguf |
| | PARAMETER temperature 0 |
| | PARAMETER num_predict 300 |
| | EOF |
| | |
| | ollama create forensic-sql -f Modelfile |
| | ollama run forensic-sql |
| | ``` |
| | |
| | ## Training Details |
| | |
| | ### Dataset — SQLiteDS |
| | |
| | - **Total examples:** 1,000 (800 train / 100 val / 100 test), fixed random seed 42 |
| | - **Forensic artifact categories:** 191 |
| | - **Reference query validation:** All 1,000 reference queries validated for execution |
| | correctness against in-memory SQLite; 50 queries (5%) corrected before final training |
| | - **Augmentation:** 3.4× expansion via instruction paraphrasing, WHERE clause reordering, |
| | and LIMIT injection — augmented examples confined to training split only |
| | - **Dataset:** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql) |
| | - **License:** CC BY 4.0 |
| | |
| | ### Hyperparameters |
| | |
| | | Parameter | Value | |
| | |---|---| |
| | | Training method | Full fine-tune (no LoRA) | |
| | | Precision | bfloat16 | |
| | | Epochs | 7 | |
| | | Learning rate | 2e-5 (peak) | |
| | | LR scheduler | Cosine with warmup | |
| | | Batch size | 1 + gradient accumulation 4 | |
| | | Max sequence length | 2048 | |
| | | Optimizer | AdamW | |
| | | Hardware | Apple M-series, 16 GB unified memory | |
| | | Training time | ~17.6 hours | |
| | | Best val loss | 0.3043 (epoch 7) | |
| | |
| | ### Key Training Insight: Sequence Length |
| | |
| | Early training runs with `max_seq_length=512` truncated 92% of examples, causing |
| | the model to learn schema generation (CREATE TABLE) instead of queries — resulting |
| | in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and |
| | improved accuracy from 50% to 68% before augmentation, and to 91% after all |
| | training components were applied. |
| | |
| | ## Limitations |
| | |
| | ### Known Issues |
| | |
| | 1. **iOS CoreData Schemas (84.0%):** The Z-prefix column naming convention |
| | (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column |
| | names alone, making these schemas harder to reason about. |
| | 2. **Hard Queries — 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries, |
| | and window functions are the primary remaining challenge. |
| | 3. **Finance & Crypto (81.8%, n=11):** Small test set; confidence intervals are |
| | wide. Interpret with caution. |
| | 4. **~1 in 11 error rate:** Approximately 9% of generated queries will contain |
| | errors. Expert review of all outputs is required before use in investigations. |
| | |
| | ### When Human Review is Especially Important |
| | - Complex multi-table queries with CTEs or window functions |
| | - Case-critical or court-admissible investigations |
| | - Any query that will be used to draw conclusions about a suspect |
| | - Queries involving rare or unusual forensic artifact schemas |
| | |
| | ## Evaluation |
| | |
| | - **Test set:** 100 examples, held-out, seed=42, non-augmented |
| | - **Metric:** Execution accuracy — query is correct iff it executes without error |
| | AND returns a result set identical to the reference query |
| | - **Reference validation:** All reference queries validated for execution correctness |
| | before evaluation; 5 broken queries in the test set were corrected |
| | - **Evaluation script:** Available in the dataset repository on Zenodo ([DOI]) |
| | |
| | ## Citation |
| | |
| | If you use this model or the SQLiteDS dataset in your research, please cite: |
| | |
| | ```bibtex |
| | @article{pawlaszczyk2026forsqlitelm, |
| | author = {Dirk Pawlaszczyk}, |
| | title = {AI-Based Automated SQL Query Generation for SQLite Databases |
| | in Mobile Forensics}, |
| | journal = {Forensic Science International: Digital Investigation}, |
| | year = {2026}, |
| | note = {FSIDI-D-26-00029} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | Apache 2.0 — following the base Llama 3.2 license terms. |
| | |
| | ## Acknowledgments |
| | |
| | - Base model: Meta's Llama 3.2-3B-Instruct |
| | - Training framework: Hugging Face Transformers |
| | - Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite) |
| | - Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses) |
| | |
| | ## Additional Resources |
| | |
| | - **Dataset (Zenodo):** [SQLiteDS — DOI to be added on publication] |
| | - **Dataset (HuggingFace):** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql) |
| | - **FQLite integration:** [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite) |
| | - **Paper:** FSIDI-D-26-00029 (under review) |
| | |
| | --- |
| | |
| | **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use. |
| | All generated SQL queries must be reviewed by a qualified practitioner before |
| | execution in live forensic investigations. The authors accept no liability for |
| | incorrect conclusions drawn from unvalidated model outputs. |
| | |