pawlaszc commited on
Commit
f83cd2d
·
verified ·
1 Parent(s): 9f7d61b

Upload MODEL_CARD.md

Browse files
Files changed (1) hide show
  1. MODEL_CARD.md +249 -181
MODEL_CARD.md CHANGED
@@ -11,7 +11,7 @@ tags:
11
  - fine-tuned
12
  base_model: unsloth/Llama-3.2-3B-Instruct
13
  datasets:
14
- - [your-username]/mobile-forensics-sql
15
  metrics:
16
  - accuracy
17
  model-index:
@@ -22,19 +22,19 @@ model-index:
22
  name: Text-to-SQL Generation
23
  dataset:
24
  type: mobile-forensics
25
- name: Mobile Forensics SQL Dataset
26
  metrics:
27
  - type: accuracy
28
- value: 79.0
29
- name: Overall Accuracy
30
  - type: accuracy
31
- value: 94.3
32
  name: Easy Queries Accuracy
33
  - type: accuracy
34
- value: 80.6
35
  name: Medium Queries Accuracy
36
  - type: accuracy
37
- value: 61.8
38
  name: Hard Queries Accuracy
39
  ---
40
 
@@ -42,55 +42,121 @@ model-index:
42
 
43
  ## Model Description
44
 
45
- **ForensicSQL** is a fine-tuned Llama 3.2 3B model specialized for generating SQLite queries for mobile forensics databases. The model converts natural language forensic investigation requests into executable SQL queries across various mobile app databases (WhatsApp, Signal, iOS Health, Android SMS, etc.).
 
 
 
 
46
 
47
- This model was developed as part of a master's thesis investigating LLM fine-tuning for forensic database analysis.
 
 
 
 
 
 
 
 
48
 
49
  ## Model Details
50
 
51
- - **Base Model:** Llama 3.2 3B Instruct
52
- - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
53
- - **Training Dataset:** 768 forensic SQL examples across 148 categories
54
- - **Training Framework:** Hugging Face Transformers + PEFT
55
- - **Model Size:**
56
- - Full (FP16): ~6 GB
57
- - GGUF Q4_K_M: ~2.3 GB
58
- - GGUF Q5_K_M: ~2.8 GB
59
- - GGUF Q8_0: ~3.8 GB
60
 
61
  ## Performance
62
 
63
- ### Overall Results
64
- - **Overall Accuracy:** 79.0%
65
- - **Schema Generation Errors:** 0% (completely eliminated)
66
- - **Executable Queries:** 79%
67
-
68
- ### Breakdown by Difficulty
69
- | Difficulty | Accuracy | Examples |
70
- |------------------------|----------|----------|
71
- | Easy (single-table) | 94.3% | 33/35 |
72
- | Medium (simple joins) | 80.6% | 25/31 |
73
- | Hard (complex queries) | 61.8% | 21/34 |
74
-
75
- ### Error Analysis
76
- | Error Type | Percentage | Description |
77
- |----------------------|------------|------------------------------------|
78
- | Column Hallucination | 18% | References non-existent columns |
79
- | Syntax Errors | 3% | Invalid SQL syntax |
80
- | Schema Generation | 0% | Eliminated through proper training |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Intended Use
83
 
84
  ### Primary Use Cases
85
- - Mobile forensics investigations
86
- - Automated SQL query generation for forensic databases
87
- - Educational tool for learning forensic database analysis
88
- - Research in text-to-SQL for specialized domains
 
 
 
 
 
 
 
 
89
 
90
  ### Out-of-Scope Use
91
- - General-purpose SQL generation (use specialized models)
92
- - Production systems requiring >95% accuracy
93
- - Real-time critical forensic decisions without human review
 
94
 
95
  ## How to Use
96
 
@@ -100,28 +166,34 @@ This model was developed as part of a master's thesis investigating LLM fine-tun
100
  from transformers import AutoModelForCausalLM, AutoTokenizer
101
  import torch
102
 
103
- # Load model and tokenizer
104
  model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
105
  tokenizer = AutoTokenizer.from_pretrained(model_name)
106
  model = AutoModelForCausalLM.from_pretrained(
107
  model_name,
108
- torch_dtype=torch.float16,
109
  device_map="auto"
110
  )
 
111
 
112
- # Prepare input
113
  schema = """
114
- CREATE TABLE messages (
115
- _id INTEGER PRIMARY KEY,
116
- address TEXT,
117
- body TEXT,
118
  date INTEGER,
119
- read INTEGER
 
 
 
 
 
 
120
  );
121
  """
122
 
123
- request = "Find all unread messages from yesterday"
124
 
 
125
  prompt = f"""Generate a valid SQLite query for this forensic database request.
126
 
127
  Database Schema:
@@ -132,204 +204,200 @@ Request: {request}
132
  SQLite Query:
133
  """
134
 
135
- # Generate SQL
136
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
137
  inputs = {k: v.to(model.device) for k, v in inputs.items()}
138
 
139
  with torch.no_grad():
140
  outputs = model.generate(
141
  **inputs,
142
- max_new_tokens=200,
143
- do_sample=False,
144
  )
145
 
146
- # Decode only the generated part
147
  input_length = inputs['input_ids'].shape[1]
148
- generated_tokens = outputs[0][input_length:]
149
- sql = tokenizer.decode(generated_tokens, skip_special_tokens=True)
150
-
151
  print(sql.strip())
152
- # Output: SELECT * FROM messages WHERE read = 0 AND date > ...
153
- ```
154
-
155
- ### Using GGUF Files (llama.cpp / Ollama)
156
-
157
- **With llama.cpp:**
158
- ```bash
159
- # Download GGUF file
160
- wget https://huggingface.co/pawlaszc/ForensicSQL-Llama-3.2-3B/resolve/main/forensic-sql-q4_k_m.gguf
161
-
162
- # Run inference
163
- ./llama-cli -m forensic-sql-q4_k_m.gguf -p "Generate SQL..."
164
  ```
165
 
166
- **With Ollama:**
167
- ```bash
168
- # Create Modelfile
169
- FROM ./forensic-sql-q4_k_m.gguf
170
- PARAMETER temperature 0
171
- PARAMETER top_p 0.9
172
-
173
- # Import
174
- ollama create forensic-sql -f Modelfile
175
-
176
- # Use
177
- ollama run forensic-sql "Schema: ...\nRequest: Find messages\nSQL:"
178
- ```
179
 
180
  ### Python Helper Class
181
 
182
  ```python
183
  class ForensicSQLGenerator:
184
- def __init__(self, model_name="pawalaszc/ForensicSQL-Llama-3.2-3B"):
185
  from transformers import AutoModelForCausalLM, AutoTokenizer
186
  import torch
187
-
188
  self.tokenizer = AutoTokenizer.from_pretrained(model_name)
189
  self.model = AutoModelForCausalLM.from_pretrained(
190
  model_name,
191
- torch_dtype=torch.float16,
192
  device_map="auto"
193
  )
194
  self.model.eval()
195
-
196
- def generate_sql(self, schema: str, request: str) -> str:
197
- prompt = f"""Generate a valid SQLite query for this forensic database request.
198
-
199
- Database Schema:
200
- {schema}
201
-
202
- Request: {request}
203
 
204
- SQLite Query:
205
- """
 
 
 
 
 
206
  inputs = self.tokenizer(
207
- prompt,
208
- return_tensors="pt",
209
- truncation=True,
210
- max_length=2048
211
  )
212
  inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
213
-
214
- input_length = inputs['input_ids'].shape[1]
215
-
216
  with torch.no_grad():
217
  outputs = self.model.generate(
218
- **inputs,
219
- max_new_tokens=200,
220
- do_sample=False,
221
  )
222
-
223
- generated_tokens = outputs[0][input_length:]
224
- sql = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
225
-
 
226
  return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
227
 
 
228
  # Usage
229
  generator = ForensicSQLGenerator()
230
- sql = generator.generate_sql(schema, request)
 
231
  ```
232
 
233
- ## Training Details
234
 
235
- ### Training Data
236
- - **Size:** 768 examples (original) → 2,304 examples (with augmentation)
237
- - **Categories:** 148 forensic database categories
238
- - **Sources:** WhatsApp, Signal, iMessage, SMS, iOS apps, Android apps
239
- - **Augmentation:** 3x paraphrase augmentation per example
 
 
 
240
 
241
- ### Training Procedure
242
- - **Method:** LoRA fine-tuning
243
- - **LoRA Rank:** 16
244
- - **LoRA Alpha:** 32
245
- - **Target Modules:** q_proj, k_proj, v_proj, o_proj
246
- - **Epochs:** 5
247
- - **Learning Rate:** 2e-5
248
- - **Batch Size:** 1 (gradient accumulation: 4)
249
- - **Max Sequence Length:** 2048 (critical for preventing truncation)
250
- - **Optimizer:** AdamW
251
- - **Scheduler:** Cosine with warmup (10%)
252
- - **Hardware:** Apple M2 (MPS)
253
- - **Training Time:** ~3.5 hours
254
 
255
- ### Key Training Insights
256
 
257
- **Critical Discovery: Sequence Length Matters**
 
 
 
 
 
258
 
259
- Initial training attempts with `max_seq_length=512` resulted in only 50% accuracy because 92% of training examples were truncated. The model learned to generate schema definitions (CREATE TABLE) instead of queries.
 
 
260
 
261
- Increasing to `max_seq_length=2048` eliminated truncation and improved accuracy from 50% to 79% (+29pp).
262
 
263
- **Lesson:** Data preprocessing and proper sequence length configuration are critical for fine-tuning success.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
264
 
265
  ## Limitations
266
 
267
  ### Known Issues
268
- 1. **Column Hallucination (18%):** Model sometimes references non-existent columns
269
- 2. **Complex Joins:** Performance drops on multi-table queries requiring JOINs (62%)
270
- 3. **Schema Understanding:** Limited understanding of foreign key relationships
271
 
272
- ### When to Use Human Review
273
- - Complex multi-table queries
274
- - Critical forensic investigations
275
- - Queries involving data deletion or modification
276
- - When accuracy >95% is required
 
 
 
 
 
 
 
 
 
 
277
 
278
  ## Evaluation
279
 
280
- ### Test Set
281
- - **Size:** 100 queries (random sample from held-out data)
282
- - **Seed:** 42 (reproducible)
283
- - **Evaluation Metric:** Exact match (query results must match expected results)
284
-
285
- ### Ablation Studies
286
-
287
- | Configuration | Accuracy | Notes |
288
- |--------------------------------|----------|----------------------|
289
- | Zero-shot baseline | 45% | No fine-tuning |
290
- | Final training (max_len=2048) | 79% | No truncation |
291
-
292
 
293
  ## Citation
294
 
295
- If you use this model in your research, please cite:
296
 
297
  ```bibtex
298
- @mastersthesis{forensicsql2025,
299
- author = {Dirk Pawlaszczyk},
300
- title = {Fine-Tuning Large Language Models for Forensic SQL Query Generation},
301
- school = {[Hochschule Mittweida University of Applied Sciences]},
302
- year = {2026},
303
- type = {Journal}
 
304
  }
305
  ```
306
 
307
- ## Model Card Authors
308
-
309
- Dirk Pawlaszczyk
310
-
311
- ## Model Card Contact
312
-
313
- For questions or issues, please open an issue on the (https://github.com/pawlaszczyk/forensic-sql) or contact pawlaszc@hs-mittweida.de.
314
-
315
  ## License
316
 
317
- This model is released under the Apache 2.0 License, following the base Llama 3.2 license.
318
 
319
  ## Acknowledgments
320
 
321
- - Base model: Meta's Llama 3.2 3B Instruct
322
- - Training framework: Hugging Face Transformers, PEFT
323
- - Dataset creation: Custom forensic database schemas
324
- - Inspiration: Text-to-SQL research community
325
 
326
  ## Additional Resources
327
 
328
- - **Dataset:** pawlaszc/mobile-forensics-sql
329
- - **GitHub:** https://github.com/pawlaszc/forensic-sql
330
- - **Paper:** [Link when published]
331
- - **Demo:** [HuggingFace Space if you create one]
332
 
333
  ---
334
 
335
- **Disclaimer:** This model is intended for research and educational purposes. Always validate generated SQL queries before execution in production forensic investigations. The model may produce incorrect queries that could lead to data loss or incorrect conclusions if used without proper review.
 
 
 
 
11
  - fine-tuned
12
  base_model: unsloth/Llama-3.2-3B-Instruct
13
  datasets:
14
+ - pawlaszc/mobile-forensics-sql
15
  metrics:
16
  - accuracy
17
  model-index:
 
22
  name: Text-to-SQL Generation
23
  dataset:
24
  type: mobile-forensics
25
+ name: SQLiteDS — Mobile Forensics SQL Dataset (corrected)
26
  metrics:
27
  - type: accuracy
28
+ value: 91.0
29
+ name: Overall Accuracy (without app name)
30
  - type: accuracy
31
+ value: 95.1
32
  name: Easy Queries Accuracy
33
  - type: accuracy
34
+ value: 87.5
35
  name: Medium Queries Accuracy
36
  - type: accuracy
37
+ value: 88.9
38
  name: Hard Queries Accuracy
39
  ---
40
 
 
42
 
43
  ## Model Description
44
 
45
+ **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
46
+ for generating SQLite queries from natural language requests against mobile forensic databases.
47
+ The model converts investigative questions into executable SQL queries across a wide range of
48
+ forensic artifact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
49
+ Instagram, blockchain wallets, and many more.
50
 
51
+ This model was developed as part of a master's thesis and accompanying journal paper
52
+ investigating LLM fine-tuning for forensic database analysis, and is integrated into
53
+ [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
54
+ analysis tool.
55
+
56
+ > **Key result:** 91.0% execution accuracy on a 100-example held-out test set — within
57
+ > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
58
+ > (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally
59
+ > with no internet connectivity required.
60
 
61
  ## Model Details
62
 
63
+ | Property | Value |
64
+ |---|---|
65
+ | **Base Model** | meta-llama/Llama-3.2-3B-Instruct |
66
+ | **Fine-tuning Method** | Full fine-tune (bf16) |
67
+ | **Training Dataset** | SQLiteDS — 800 training examples, 191 forensic artifact categories |
68
+ | **Training Framework** | Hugging Face Transformers |
69
+ | **Best Val Loss** | 0.3043 (7 epochs) |
70
+ | **Model Size (bf16)** | ~6 GB |
71
+ | **Hardware Required** | 16 GB unified memory (Apple M-series) or equivalent GPU |
72
 
73
  ## Performance
74
 
75
+ ### Overall Results (fixed dataset, n=100, best configuration)
76
+
77
+ | Metric | Value |
78
+ |---|---|
79
+ | **Overall Accuracy** | **91.0%** (91/100) |
80
+ | 95% CI (Wilson) | [83.8%, 95.2%] |
81
+ | Executable Queries | 92/100 |
82
+ | GPT-4o Accuracy | 95.0% (gap: 4 pp, p ≈ 0.39) |
83
+ | Base Model (no fine-tuning) | 35.0% |
84
+ | Improvement over base | +56 pp |
85
+
86
+ ### Accuracy by Query Difficulty
87
+
88
+ | Difficulty | Accuracy | n | 95% CI | vs. GPT-4o |
89
+ |---|---|---|---|---|
90
+ | Easy (single-table) | **95.1%** | 39/41 | [83.9%, 98.7%] | 0.0 pp |
91
+ | Medium (joins, aggregation) | **87.5%** | 28/32 | [71.9%, 95.0%] | 0.0 pp |
92
+ | Hard (CTEs, window functions) | **88.9%** | 24/27 | [71.9%, 96.1%] | −3.7 pp |
93
+
94
+ ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap
95
+ is concentrated on Hard queries (complex CTEs, window functions, multi-table joins).
96
+
97
+ ### Accuracy by Forensic Domain
98
+
99
+ | Domain | Accuracy | n | 95% CI |
100
+ |---|---|---|---|
101
+ | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
102
+ | Android Artifacts | **94.4%** | 17/18 | [74.2%, 99.0%] |
103
+ | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
104
+ | iOS CoreData | **84.0%** | 21/25 | [65.3%, 93.6%] |
105
+ | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
106
+
107
+ ### Prompt Configuration Ablation
108
+
109
+ | Configuration | Overall | Easy | Medium | Hard | iOS |
110
+ |---|---|---|---|---|---|
111
+ | **WITHOUT App Name** ★ | **91.0%** | **95.1%** | 87.5% | **88.9%** | 84.0% |
112
+ | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
113
+
114
+ ★ Primary configuration — omitting the application name from the prompt yields
115
+ 3 pp higher overall accuracy. Interestingly, including the app name helps iOS
116
+ CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary
117
+ configuration without app name is recommended for general use.
118
+
119
+ ### Post-Processing Pipeline Contribution
120
+
121
+ | Component | Queries saved |
122
+ |---|---|
123
+ | Execution feedback (retry) | 7 |
124
+ | Alias normalization | 18 |
125
+ | Column corrections (Levenshtein) | 2 |
126
+
127
+ ### Training Progression
128
+
129
+ | Configuration | Val Loss | Accuracy | Δ |
130
+ |---|---|---|---|
131
+ | Base model (no fine-tuning) | — | 35.0% | — |
132
+ | Fine-tuned, no augmentation | — | 68.0% | +33 pp |
133
+ | + Data augmentation (3.4×) | — | 74.0% | +6 pp |
134
+ | + Extended training (7 epochs) | 0.3617 | 84.0% | +10 pp |
135
+ | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
136
+ | + Execution feedback | 0.3617 | 90.0% | +3 pp |
137
+ | + Corrected training dataset (v5) | **0.3043** | **91.0%** | +1 pp |
138
 
139
  ## Intended Use
140
 
141
  ### Primary Use Cases
142
+ - Mobile forensics investigations: automated SQL query drafting against seized device databases
143
+ - Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows)
144
+ - Research in domain-specific Text-to-SQL
145
+ - Educational use for learning forensic database analysis
146
+
147
+ ### Important: This Model is a Drafting Assistant
148
+
149
+ > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
150
+ > that require review by a practitioner with sufficient SQL knowledge before any reliance
151
+ > is placed on their results. The 91.0% accuracy means approximately **1 in 11 queries
152
+ > contains an error**. In court-admissible or case-critical work, all outputs must be
153
+ > independently validated.
154
 
155
  ### Out-of-Scope Use
156
+ - Autonomous forensic decision-making without human review
157
+ - Production systems requiring >95% guaranteed accuracy
158
+ - General-purpose SQL generation outside the forensic domain
159
+ - Non-SQLite databases (PostgreSQL, MySQL, etc.)
160
 
161
  ## How to Use
162
 
 
166
  from transformers import AutoModelForCausalLM, AutoTokenizer
167
  import torch
168
 
 
169
  model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
170
  tokenizer = AutoTokenizer.from_pretrained(model_name)
171
  model = AutoModelForCausalLM.from_pretrained(
172
  model_name,
173
+ torch_dtype=torch.bfloat16,
174
  device_map="auto"
175
  )
176
+ model.eval()
177
 
 
178
  schema = """
179
+ CREATE TABLE message (
180
+ ROWID INTEGER PRIMARY KEY,
181
+ text TEXT,
182
+ handle_id INTEGER,
183
  date INTEGER,
184
+ is_from_me INTEGER,
185
+ cache_has_attachments INTEGER
186
+ );
187
+ CREATE TABLE handle (
188
+ ROWID INTEGER PRIMARY KEY,
189
+ id TEXT,
190
+ service TEXT
191
  );
192
  """
193
 
194
+ request = "Find all messages received in the last 7 days that contain attachments"
195
 
196
+ # Note: do NOT use apply_chat_template — use plain-text prompt
197
  prompt = f"""Generate a valid SQLite query for this forensic database request.
198
 
199
  Database Schema:
 
204
  SQLite Query:
205
  """
206
 
 
207
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
208
  inputs = {k: v.to(model.device) for k, v in inputs.items()}
209
 
210
  with torch.no_grad():
211
  outputs = model.generate(
212
  **inputs,
213
+ max_new_tokens=300,
214
+ do_sample=False, # greedy decoding — do not change
215
  )
216
 
 
217
  input_length = inputs['input_ids'].shape[1]
218
+ sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
 
 
219
  print(sql.strip())
 
 
 
 
 
 
 
 
 
 
 
 
220
  ```
221
 
222
+ > **Important:** Use plain-text tokenization (do **not** call `apply_chat_template`).
223
+ > The model was trained and evaluated with a plain-text prompt format.
224
+ > Use `do_sample=False` (greedy decoding) for reproducible results.
 
 
 
 
 
 
 
 
 
 
225
 
226
  ### Python Helper Class
227
 
228
  ```python
229
  class ForensicSQLGenerator:
230
+ def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"):
231
  from transformers import AutoModelForCausalLM, AutoTokenizer
232
  import torch
233
+
234
  self.tokenizer = AutoTokenizer.from_pretrained(model_name)
235
  self.model = AutoModelForCausalLM.from_pretrained(
236
  model_name,
237
+ torch_dtype=torch.bfloat16,
238
  device_map="auto"
239
  )
240
  self.model.eval()
 
 
 
 
 
 
 
 
241
 
242
+ def generate_sql(self, schema: str, request: str) -> str:
243
+ prompt = (
244
+ "Generate a valid SQLite query for this forensic database request.\n\n"
245
+ f"Database Schema:\n{schema}\n\n"
246
+ f"Request: {request}\n\n"
247
+ "SQLite Query:\n"
248
+ )
249
  inputs = self.tokenizer(
250
+ prompt, return_tensors="pt", truncation=True, max_length=2048
 
 
 
251
  )
252
  inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
253
+ input_length = inputs["input_ids"].shape[1]
254
+
 
255
  with torch.no_grad():
256
  outputs = self.model.generate(
257
+ **inputs, max_new_tokens=300, do_sample=False
 
 
258
  )
259
+
260
+ sql = self.tokenizer.decode(
261
+ outputs[0][input_length:], skip_special_tokens=True
262
+ )
263
+ # Return first statement only, normalized
264
  return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
265
 
266
+
267
  # Usage
268
  generator = ForensicSQLGenerator()
269
+ sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours")
270
+ print(sql)
271
  ```
272
 
273
+ ### With Ollama / llama.cpp (GGUF)
274
 
275
+ ```bash
276
+ # With llama.cpp
277
+ ./llama-cli -m forensic-sql-q4_k_m.gguf \
278
+ --temp 0 \
279
+ -p "Generate a valid SQLite query for this forensic database request.
280
+
281
+ Database Schema:
282
+ CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER);
283
 
284
+ Request: Find all messages sent after midnight
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
+ SQLite Query:"
287
 
288
+ # With Ollama create a Modelfile
289
+ cat > Modelfile << 'EOF'
290
+ FROM ./forensic-sql-q4_k_m.gguf
291
+ PARAMETER temperature 0
292
+ PARAMETER num_predict 300
293
+ EOF
294
 
295
+ ollama create forensic-sql -f Modelfile
296
+ ollama run forensic-sql
297
+ ```
298
 
299
+ ## Training Details
300
 
301
+ ### Dataset SQLiteDS
302
+
303
+ - **Total examples:** 1,000 (800 train / 100 val / 100 test), fixed random seed 42
304
+ - **Forensic artifact categories:** 191
305
+ - **Reference query validation:** All 1,000 reference queries validated for execution
306
+ correctness against in-memory SQLite; 50 queries (5%) corrected before final training
307
+ - **Augmentation:** 3.4× expansion via instruction paraphrasing, WHERE clause reordering,
308
+ and LIMIT injection — augmented examples confined to training split only
309
+ - **Dataset:** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
310
+ - **License:** CC BY 4.0
311
+
312
+ ### Hyperparameters
313
+
314
+ | Parameter | Value |
315
+ |---|---|
316
+ | Training method | Full fine-tune (no LoRA) |
317
+ | Precision | bfloat16 |
318
+ | Epochs | 7 |
319
+ | Learning rate | 2e-5 (peak) |
320
+ | LR scheduler | Cosine with warmup |
321
+ | Batch size | 1 + gradient accumulation 4 |
322
+ | Max sequence length | 2048 |
323
+ | Optimizer | AdamW |
324
+ | Hardware | Apple M-series, 16 GB unified memory |
325
+ | Training time | ~17.6 hours |
326
+ | Best val loss | 0.3043 (epoch 7) |
327
+
328
+ ### Key Training Insight: Sequence Length
329
+
330
+ Early training runs with `max_seq_length=512` truncated 92% of examples, causing
331
+ the model to learn schema generation (CREATE TABLE) instead of queries — resulting
332
+ in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and
333
+ improved accuracy from 50% to 68% before augmentation, and to 91% after all
334
+ training components were applied.
335
 
336
  ## Limitations
337
 
338
  ### Known Issues
 
 
 
339
 
340
+ 1. **iOS CoreData Schemas (84.0%):** The Z-prefix column naming convention
341
+ (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
342
+ names alone, making these schemas harder to reason about.
343
+ 2. **Hard Queries 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
344
+ and window functions are the primary remaining challenge.
345
+ 3. **Finance & Crypto (81.8%, n=11):** Small test set; confidence intervals are
346
+ wide. Interpret with caution.
347
+ 4. **~1 in 11 error rate:** Approximately 9% of generated queries will contain
348
+ errors. Expert review of all outputs is required before use in investigations.
349
+
350
+ ### When Human Review is Especially Important
351
+ - Complex multi-table queries with CTEs or window functions
352
+ - Case-critical or court-admissible investigations
353
+ - Any query that will be used to draw conclusions about a suspect
354
+ - Queries involving rare or unusual forensic artifact schemas
355
 
356
  ## Evaluation
357
 
358
+ - **Test set:** 100 examples, held-out, seed=42, non-augmented
359
+ - **Metric:** Execution accuracy query is correct iff it executes without error
360
+ AND returns a result set identical to the reference query
361
+ - **Reference validation:** All reference queries validated for execution correctness
362
+ before evaluation; 5 broken queries in the test set were corrected
363
+ - **Evaluation script:** Available in the dataset repository on Zenodo ([DOI])
 
 
 
 
 
 
364
 
365
  ## Citation
366
 
367
+ If you use this model or the SQLiteDS dataset in your research, please cite:
368
 
369
  ```bibtex
370
+ @article{pawlaszczyk2026forsqlitelm,
371
+ author = {Dirk Pawlaszczyk},
372
+ title = {AI-Based Automated SQL Query Generation for SQLite Databases
373
+ in Mobile Forensics},
374
+ journal = {Forensic Science International: Digital Investigation},
375
+ year = {2026},
376
+ note = {FSIDI-D-26-00029}
377
  }
378
  ```
379
 
 
 
 
 
 
 
 
 
380
  ## License
381
 
382
+ Apache 2.0 following the base Llama 3.2 license terms.
383
 
384
  ## Acknowledgments
385
 
386
+ - Base model: Meta's Llama 3.2-3B-Instruct
387
+ - Training framework: Hugging Face Transformers
388
+ - Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite)
389
+ - Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses)
390
 
391
  ## Additional Resources
392
 
393
+ - **Dataset (Zenodo):** [SQLiteDS — DOI to be added on publication]
394
+ - **Dataset (HuggingFace):** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
395
+ - **FQLite integration:** [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite)
396
+ - **Paper:** FSIDI-D-26-00029 (under review)
397
 
398
  ---
399
 
400
+ **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
401
+ All generated SQL queries must be reviewed by a qualified practitioner before
402
+ execution in live forensic investigations. The authors accept no liability for
403
+ incorrect conclusions drawn from unvalidated model outputs.