Update MODEL_CARD.md

fe9e9a6 verified 3 days ago

13.8 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- sql
	- forensics
	- text-to-sql
	- llama
	- fine-tuned
	base_model: unsloth/Llama-3.2-3B-Instruct
	datasets:
	- pawlaszc/mobile-forensics-sql
	metrics:
	- accuracy
	model-index:
	- name: ForensicSQL-Llama-3.2-3B
	results:
	- task:
	type: text-to-sql
	name: Text-to-SQL Generation
	dataset:
	type: mobile-forensics
	name: SQLiteDS — Mobile Forensics SQL Dataset (corrected)
	metrics:
	- type: accuracy
	value: 91.0
	name: Overall Accuracy (without app name)
	- type: accuracy
	value: 95.1
	name: Easy Queries Accuracy
	- type: accuracy
	value: 87.5
	name: Medium Queries Accuracy
	- type: accuracy
	value: 88.9
	name: Hard Queries Accuracy
	---

	# ForensicSQL-Llama-3.2-3B

	## Model Description

	ForSQLiteLM (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
	for generating SQLite queries from natural language requests against mobile forensic databases.
	The model converts investigative questions into executable SQL queries across a wide range of
	forensic artifact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
	Instagram, blockchain wallets, and many more.

	This model was developed as part of a research project and accompanying journal paper
	investigating LLM fine-tuning for forensic database analysis, and is integrated into
	[FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
	analysis tool.

	> Key result: 91.0% execution accuracy on a 100-example held-out test set — within
	> 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
	> (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally
	> with no internet connectivity required.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| meta-llama/Llama-3.2-3B-Instruct \|
	\| Fine-tuning Method \| Full fine-tune (bf16) \|
	\| Training Dataset \| SQLiteDS — 800 training examples, 191 forensic artifact categories \|
	\| Training Framework \| Hugging Face Transformers \|
	\| Best Val Loss \| 0.3043 (7 epochs) \|
	\| Model Size (bf16) \| ~6 GB \|
	\| Hardware Required \| 16 GB unified memory (Apple M-series) or equivalent GPU \|

	## Performance

	### Overall Results (fixed dataset, n=100, best configuration)

	\| Metric \| Value \|
	\|---\|---\|
	\| Overall Accuracy \| 91.0% (91/100) \|
	\| 95% CI (Wilson) \| [83.8%, 95.2%] \|
	\| Executable Queries \| 92/100 \|
	\| GPT-4o Accuracy \| 95.0% (gap: 4 pp, p ≈ 0.39) \|
	\| Base Model (no fine-tuning) \| 35.0% \|
	\| Improvement over base \| +56 pp \|

	### Accuracy by Query Difficulty

	\| Difficulty \| Accuracy \| n \| 95% CI \| vs. GPT-4o \|
	\|---\|---\|---\|---\|---\|
	\| Easy (single-table) \| 95.1% \| 39/41 \| [83.9%, 98.7%] \| 0.0 pp \|
	\| Medium (joins, aggregation) \| 87.5% \| 28/32 \| [71.9%, 95.0%] \| 0.0 pp \|
	\| Hard (CTEs, window functions) \| 88.9% \| 24/27 \| [71.9%, 96.1%] \| −3.7 pp \|

	ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap
	is concentrated on Hard queries (complex CTEs, window functions, multi-table joins).

	### Accuracy by Forensic Domain

	\| Domain \| Accuracy \| n \| 95% CI \|
	\|---\|---\|---\|---\|
	\| Messaging & Social \| 100.0% \| 28/28 \| [87.9%, 100.0%] \|
	\| Android Artifacts \| 94.4% \| 17/18 \| [74.2%, 99.0%] \|
	\| Productivity & Other \| 88.9% \| 16/18 \| [67.2%, 96.9%] \|
	\| iOS CoreData \| 84.0% \| 21/25 \| [65.3%, 93.6%] \|
	\| Finance & Crypto \| 81.8% \| 9/11 \| [52.3%, 94.9%] \|

	### Prompt Configuration Ablation

	\| Configuration \| Overall \| Easy \| Medium \| Hard \| iOS \|
	\|---\|---\|---\|---\|---\|---\|
	\| WITHOUT App Name ★ \| 91.0% \| 95.1% \| 87.5% \| 88.9% \| 84.0% \|
	\| WITH App Name \| 88.0% \| 92.7% \| 87.5% \| 81.5% \| 88.0% \|

	★ Primary configuration — omitting the application name from the prompt yields
	3 pp higher overall accuracy. Interestingly, including the app name helps iOS
	CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary
	configuration without app name is recommended for general use.

	### Post-Processing Pipeline Contribution

	\| Component \| Queries saved \|
	\|---\|---\|
	\| Execution feedback (retry) \| 7 \|
	\| Alias normalization \| 18 \|
	\| Column corrections (Levenshtein) \| 2 \|

	### Training Progression

	\| Configuration \| Val Loss \| Accuracy \| Δ \|
	\|---\|---\|---\|---\|
	\| Base model (no fine-tuning) \| — \| 35.0% \| — \|
	\| Fine-tuned, no augmentation \| — \| 68.0% \| +33 pp \|
	\| + Data augmentation (3.4×) \| — \| 74.0% \| +6 pp \|
	\| + Extended training (7 epochs) \| 0.3617 \| 84.0% \| +10 pp \|
	\| + Post-processing pipeline \| 0.3617 \| 87.0% \| +3 pp \|
	\| + Execution feedback \| 0.3617 \| 90.0% \| +3 pp \|
	\| + Corrected training dataset (v5) \| 0.3043 \| 91.0% \| +1 pp \|

	## Intended Use

	### Primary Use Cases
	- Mobile forensics investigations: automated SQL query drafting against seized device databases
	- Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows)
	- Research in domain-specific Text-to-SQL
	- Educational use for learning forensic database analysis

	### Important: This Model is a Drafting Assistant

	> ForSQLiteLM is not a replacement for SQL expertise. It generates candidate queries
	> that require review by a practitioner with sufficient SQL knowledge before any reliance
	> is placed on their results. The 91.0% accuracy means approximately **1 in 11 queries
	> contains an error**. In court-admissible or case-critical work, all outputs must be
	> independently validated.

	### Out-of-Scope Use
	- Autonomous forensic decision-making without human review
	- Production systems requiring >95% guaranteed accuracy
	- General-purpose SQL generation outside the forensic domain
	- Non-SQLite databases (PostgreSQL, MySQL, etc.)

	## How to Use

	### Quick Start (Transformers)

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	model.eval()

	schema = """
	CREATE TABLE message (
	ROWID INTEGER PRIMARY KEY,
	text TEXT,
	handle_id INTEGER,
	date INTEGER,
	is_from_me INTEGER,
	cache_has_attachments INTEGER
	);
	CREATE TABLE handle (
	ROWID INTEGER PRIMARY KEY,
	id TEXT,
	service TEXT
	);
	"""

	request = "Find all messages received in the last 7 days that contain attachments"

	# Note: do NOT use apply_chat_template — use plain-text prompt
	prompt = f"""Generate a valid SQLite query for this forensic database request.

	Database Schema:
	{schema}

	Request: {request}

	SQLite Query:
	"""

	inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
	inputs = {k: v.to(model.device) for k, v in inputs.items()}

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=300,
	do_sample=False, # greedy decoding — do not change
	)

	input_length = inputs['input_ids'].shape[1]
	sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
	print(sql.strip())
	```

	> Important: Use plain-text tokenization (do not call `apply_chat_template`).
	> The model was trained and evaluated with a plain-text prompt format.
	> Use `do_sample=False` (greedy decoding) for reproducible results.

	### Python Helper Class

	```python
	class ForensicSQLGenerator:
	def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"):
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	self.tokenizer = AutoTokenizer.from_pretrained(model_name)
	self.model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	self.model.eval()

	def generate_sql(self, schema: str, request: str) -> str:
	prompt = (
	"Generate a valid SQLite query for this forensic database request.\n\n"
	f"Database Schema:\n{schema}\n\n"
	f"Request: {request}\n\n"
	"SQLite Query:\n"
	)
	inputs = self.tokenizer(
	prompt, return_tensors="pt", truncation=True, max_length=2048
	)
	inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
	input_length = inputs["input_ids"].shape[1]

	with torch.no_grad():
	outputs = self.model.generate(
	**inputs, max_new_tokens=300, do_sample=False
	)

	sql = self.tokenizer.decode(
	outputs[0][input_length:], skip_special_tokens=True
	)
	# Return first statement only, normalized
	return sql.strip().split("\n")[0].strip().rstrip(";") + ";"


	# Usage
	generator = ForensicSQLGenerator()
	sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours")
	print(sql)
	```

	### With Ollama / llama.cpp (GGUF)

	```bash
	# With llama.cpp
	./llama-cli -m forensic-sql-q4_k_m.gguf \
	--temp 0 \
	-p "Generate a valid SQLite query for this forensic database request.

	Database Schema:
	CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER);

	Request: Find all messages sent after midnight

	SQLite Query:"

	# With Ollama — create a Modelfile
	cat > Modelfile << 'EOF'
	FROM ./forensic-sql-q4_k_m.gguf
	PARAMETER temperature 0
	PARAMETER num_predict 300
	EOF

	ollama create forensic-sql -f Modelfile
	ollama run forensic-sql
	```

	## Training Details

	### Dataset — SQLiteDS

	- Total examples: 1,000 (800 train / 100 val / 100 test), fixed random seed 42
	- Forensic artifact categories: 191
	- Reference query validation: All 1,000 reference queries validated for execution
	correctness against in-memory SQLite; 50 queries (5%) corrected before final training
	- Augmentation: 3.4× expansion via instruction paraphrasing, WHERE clause reordering,
	and LIMIT injection — augmented examples confined to training split only
	- Dataset: [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
	- License: CC BY 4.0

	### Hyperparameters

	\| Parameter \| Value \|
	\|---\|---\|
	\| Training method \| Full fine-tune (no LoRA) \|
	\| Precision \| bfloat16 \|
	\| Epochs \| 7 \|
	\| Learning rate \| 2e-5 (peak) \|
	\| LR scheduler \| Cosine with warmup \|
	\| Batch size \| 1 + gradient accumulation 4 \|
	\| Max sequence length \| 2048 \|
	\| Optimizer \| AdamW \|
	\| Hardware \| Apple M-series, 16 GB unified memory \|
	\| Training time \| ~17.6 hours \|
	\| Best val loss \| 0.3043 (epoch 7) \|

	### Key Training Insight: Sequence Length

	Early training runs with `max_seq_length=512` truncated 92% of examples, causing
	the model to learn schema generation (CREATE TABLE) instead of queries — resulting
	in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and
	improved accuracy from 50% to 68% before augmentation, and to 91% after all
	training components were applied.

	## Limitations

	### Known Issues

	1. iOS CoreData Schemas (84.0%): The Z-prefix column naming convention
	(e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
	names alone, making these schemas harder to reason about.
	2. Hard Queries — 3.7 pp gap to GPT-4o: Complex CTEs, recursive queries,
	and window functions are the primary remaining challenge.
	3. Finance & Crypto (81.8%, n=11): Small test set; confidence intervals are
	wide. Interpret with caution.
	4. ~1 in 11 error rate: Approximately 9% of generated queries will contain
	errors. Expert review of all outputs is required before use in investigations.

	### When Human Review is Especially Important
	- Complex multi-table queries with CTEs or window functions
	- Case-critical or court-admissible investigations
	- Any query that will be used to draw conclusions about a suspect
	- Queries involving rare or unusual forensic artifact schemas

	## Evaluation

	- Test set: 100 examples, held-out, seed=42, non-augmented
	- Metric: Execution accuracy — query is correct iff it executes without error
	AND returns a result set identical to the reference query
	- Reference validation: All reference queries validated for execution correctness
	before evaluation; 5 broken queries in the test set were corrected
	- Evaluation script: Available in the dataset repository on Zenodo ([DOI])

	## Citation

	If you use this model or the SQLiteDS dataset in your research, please cite:

	```bibtex
	@article{pawlaszczyk2026forsqlitelm,
	author = {Dirk Pawlaszczyk},
	title = {AI-Based Automated SQL Query Generation for SQLite Databases
	in Mobile Forensics},
	journal = {Forensic Science International: Digital Investigation},
	year = {2026},
	note = {FSIDI-D-26-00029}
	}
	```

	## License

	Apache 2.0 — following the base Llama 3.2 license terms.

	## Acknowledgments

	- Base model: Meta's Llama 3.2-3B-Instruct
	- Training framework: Hugging Face Transformers
	- Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite)
	- Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses)

	## Additional Resources

	- Dataset (Zenodo): [SQLiteDS — DOI to be added on publication]
	- Dataset (HuggingFace): [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
	- FQLite integration: [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite)
	- Paper: FSIDI-D-26-00029 (under review)

	---

	Disclaimer: ForSQLiteLM is intended for research and forensic practitioner use.
	All generated SQL queries must be reviewed by a qualified practitioner before
	execution in live forensic investigations. The authors accept no liability for
	incorrect conclusions drawn from unvalidated model outputs.