pawlaszc commited on
Commit
d9ad01f
·
verified ·
1 Parent(s): f83cd2d

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +255 -188
README.md CHANGED
@@ -1,14 +1,15 @@
1
  ---
2
- license: apache-2.0
3
  language:
4
  - en
5
- base_model:
6
- - unsloth/Llama-3.2-3B-Instruct
7
  tags:
8
- - digital
9
  - forensics
10
- - sqlite
11
-
 
 
12
  datasets:
13
  - pawlaszc/mobile-forensics-sql
14
  metrics:
@@ -21,77 +22,141 @@ model-index:
21
  name: Text-to-SQL Generation
22
  dataset:
23
  type: mobile-forensics
24
- name: Mobile Forensics SQL Dataset
25
  metrics:
26
  - type: accuracy
27
- value: 79.0
28
- name: Overall Accuracy
29
  - type: accuracy
30
- value: 94.3
31
  name: Easy Queries Accuracy
32
  - type: accuracy
33
- value: 80.6
34
  name: Medium Queries Accuracy
35
  - type: accuracy
36
- value: 61.8
37
  name: Hard Queries Accuracy
38
-
39
-
40
  ---
41
 
42
  # ForensicSQL-Llama-3.2-3B
43
 
44
  ## Model Description
45
 
46
- **ForensicSQL** is a fine-tuned Llama 3.2 3B model specialised for generating SQLite queries for mobile forensics databases. The model converts natural language forensic investigation requests into executable SQL queries across various mobile app databases (WhatsApp, Signal, iOS Health, Android SMS, etc.).
 
 
 
 
47
 
48
- This model was developed as part of a research project investigating LLM fine-tuning for forensic database analysis.
 
 
 
 
 
 
 
 
49
 
50
  ## Model Details
51
 
52
- - **Base Model:** Llama 3.2 3B Instruct
53
- - **Fine-tuning Method:** LoRA (Low-Rank Adaptation)
54
- - **Training Dataset:** 768 forensic SQL examples across 148 categories
55
- - **Training Framework:** Hugging Face Transformers + PEFT
56
- - **Model Size:**
57
- - Full (FP16): ~6 GB
58
- - GGUF Q4_K_M: ~2.3 GB
59
- - GGUF Q5_K_M: ~2.8 GB
60
- - GGUF Q8_0: ~3.8 GB
61
 
62
  ## Performance
63
 
64
- ### Overall Results
65
- - **Overall Accuracy:** 79.0%
66
- - **Schema Generation Errors:** 0% (completely eliminated)
67
- - **Executable Queries:** 79%
68
-
69
- ### Breakdown by Difficulty
70
- | Difficulty | Accuracy | Examples |
71
- |------------------------|----------|----------|
72
- | Easy (single-table) | 94.3% | 33/35 |
73
- | Medium (simple joins) | 80.6% | 25/31 |
74
- | Hard (complex queries) | 61.8% | 21/34 |
75
-
76
- ### Error Analysis
77
- | Error Type | Percentage | Description |
78
- |----------------------|------------|------------------------------------|
79
- | Column Hallucination | 18% | References non-existent columns |
80
- | Syntax Errors | 3% | Invalid SQL syntax |
81
- | Schema Generation | 0% | Eliminated through proper training |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Intended Use
84
 
85
  ### Primary Use Cases
86
- - Mobile forensics investigations
87
- - Automated SQL query generation for forensic databases
88
- - Educational tool for learning forensic database analysis
89
- - Research in text-to-SQL for specialized domains
 
 
 
 
 
 
 
 
90
 
91
  ### Out-of-Scope Use
92
- - General-purpose SQL generation (use specialized models)
93
- - Production systems requiring >95% accuracy
94
- - Real-time critical forensic decisions without human review
 
95
 
96
  ## How to Use
97
 
@@ -101,28 +166,34 @@ This model was developed as part of a research project investigating LLM fine-tu
101
  from transformers import AutoModelForCausalLM, AutoTokenizer
102
  import torch
103
 
104
- # Load model and tokenizer
105
  model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
106
  tokenizer = AutoTokenizer.from_pretrained(model_name)
107
  model = AutoModelForCausalLM.from_pretrained(
108
  model_name,
109
- torch_dtype=torch.float16,
110
  device_map="auto"
111
  )
 
112
 
113
- # Prepare input
114
  schema = """
115
- CREATE TABLE messages (
116
- _id INTEGER PRIMARY KEY,
117
- address TEXT,
118
- body TEXT,
119
  date INTEGER,
120
- read INTEGER
 
 
 
 
 
 
121
  );
122
  """
123
 
124
- request = "Find all unread messages from yesterday"
125
 
 
126
  prompt = f"""Generate a valid SQLite query for this forensic database request.
127
 
128
  Database Schema:
@@ -133,204 +204,200 @@ Request: {request}
133
  SQLite Query:
134
  """
135
 
136
- # Generate SQL
137
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
138
  inputs = {k: v.to(model.device) for k, v in inputs.items()}
139
 
140
  with torch.no_grad():
141
  outputs = model.generate(
142
  **inputs,
143
- max_new_tokens=200,
144
- do_sample=False,
145
  )
146
 
147
- # Decode only the generated part
148
  input_length = inputs['input_ids'].shape[1]
149
- generated_tokens = outputs[0][input_length:]
150
- sql = tokenizer.decode(generated_tokens, skip_special_tokens=True)
151
-
152
  print(sql.strip())
153
- # Output: SELECT * FROM messages WHERE read = 0 AND date > ...
154
- ```
155
-
156
- ### Using GGUF Files (llama.cpp / Ollama)
157
-
158
- **With llama.cpp:**
159
- ```bash
160
- # Download GGUF file
161
- wget https://huggingface.co/pawlaszc/ForensicSQL-Llama-3.2-3B/resolve/main/forensic-sql-q4_k_m.gguf
162
-
163
- # Run inference
164
- ./llama-cli -m forensic-sql-q4_k_m.gguf -p "Generate SQL..."
165
  ```
166
 
167
- **With Ollama:**
168
- ```bash
169
- # Create Modelfile
170
- FROM ./forensic-sql-q4_k_m.gguf
171
- PARAMETER temperature 0
172
- PARAMETER top_p 0.9
173
-
174
- # Import
175
- ollama create forensic-sql -f Modelfile
176
-
177
- # Use
178
- ollama run forensic-sql "Schema: ...\nRequest: Find messages\nSQL:"
179
- ```
180
 
181
  ### Python Helper Class
182
 
183
  ```python
184
  class ForensicSQLGenerator:
185
- def __init__(self, model_name="pawalaszc/ForensicSQL-Llama-3.2-3B"):
186
  from transformers import AutoModelForCausalLM, AutoTokenizer
187
  import torch
188
-
189
  self.tokenizer = AutoTokenizer.from_pretrained(model_name)
190
  self.model = AutoModelForCausalLM.from_pretrained(
191
  model_name,
192
- torch_dtype=torch.float16,
193
  device_map="auto"
194
  )
195
  self.model.eval()
196
-
197
- def generate_sql(self, schema: str, request: str) -> str:
198
- prompt = f"""Generate a valid SQLite query for this forensic database request.
199
-
200
- Database Schema:
201
- {schema}
202
 
203
- Request: {request}
204
-
205
- SQLite Query:
206
- """
 
 
 
207
  inputs = self.tokenizer(
208
- prompt,
209
- return_tensors="pt",
210
- truncation=True,
211
- max_length=2048
212
  )
213
  inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
214
-
215
- input_length = inputs['input_ids'].shape[1]
216
-
217
  with torch.no_grad():
218
  outputs = self.model.generate(
219
- **inputs,
220
- max_new_tokens=200,
221
- do_sample=False,
222
  )
223
-
224
- generated_tokens = outputs[0][input_length:]
225
- sql = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
226
-
 
227
  return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
228
 
 
229
  # Usage
230
  generator = ForensicSQLGenerator()
231
- sql = generator.generate_sql(schema, request)
 
232
  ```
233
 
234
- ## Training Details
 
 
 
 
 
 
235
 
236
- ### Training Data
237
- - **Size:** 768 examples (original) 2,304 examples (with augmentation)
238
- - **Categories:** 148 forensic database categories
239
- - **Sources:** WhatsApp, Signal, iMessage, SMS, iOS apps, Android apps
240
- - **Augmentation:** 3x paraphrase augmentation per example
241
 
242
- ### Training Procedure
243
- - **Method:** LoRA fine-tuning
244
- - **LoRA Rank:** 16
245
- - **LoRA Alpha:** 32
246
- - **Target Modules:** q_proj, k_proj, v_proj, o_proj
247
- - **Epochs:** 5
248
- - **Learning Rate:** 2e-5
249
- - **Batch Size:** 1 (gradient accumulation: 4)
250
- - **Max Sequence Length:** 2048 (critical for preventing truncation)
251
- - **Optimizer:** AdamW
252
- - **Scheduler:** Cosine with warmup (10%)
253
- - **Hardware:** Apple M2 (MPS)
254
- - **Training Time:** ~3.5 hours
255
 
256
- ### Key Training Insights
257
 
258
- **Critical Discovery: Sequence Length Matters**
 
 
 
 
 
259
 
260
- Initial training attempts with `max_seq_length=512` resulted in only 50% accuracy because 92% of training examples were truncated. The model learned to generate schema definitions (CREATE TABLE) instead of queries.
 
 
261
 
262
- Increasing to `max_seq_length=2048` eliminated truncation and improved accuracy from 50% to 79% (+29pp).
263
 
264
- **Lesson:** Data preprocessing and proper sequence length configuration are critical for fine-tuning success.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
265
 
266
  ## Limitations
267
 
268
  ### Known Issues
269
- 1. **Column Hallucination (18%):** Model sometimes references non-existent columns
270
- 2. **Complex Joins:** Performance drops on multi-table queries requiring JOINs (62%)
271
- 3. **Schema Understanding:** Limited understanding of foreign key relationships
272
 
273
- ### When to Use Human Review
274
- - Complex multi-table queries
275
- - Critical forensic investigations
276
- - Queries involving data deletion or modification
277
- - When accuracy >95% is required
 
 
 
 
 
 
 
 
 
 
278
 
279
  ## Evaluation
280
 
281
- ### Test Set
282
- - **Size:** 100 queries (random sample from held-out data)
283
- - **Seed:** 42 (reproducible)
284
- - **Evaluation Metric:** Exact match (query results must match expected results)
285
-
286
- ### Ablation Studies
287
-
288
- | Configuration | Accuracy | Notes |
289
- |--------------------------------|----------|----------------------|
290
- | Zero-shot baseline | 45% | No fine-tuning |
291
- | Final training (max_len=2048) | 79% | No truncation |
292
-
293
 
294
  ## Citation
295
 
296
- If you use this model in your research, please cite:
297
 
298
  ```bibtex
299
- @misc{dirk_pawlaszczyk_2026,
300
- author = { Dirk Pawlaszczyk and Ronny Bodach and Christian Hummert and Philipp Engler},
301
- title = { DigitalForensicsText2SQLite},
302
- year = 2026,
303
- url = { https://huggingface.co/pawlaszc/DigitalForensicsText2SQLite },
304
- doi = { 10.57967/hf/7675 },
305
- publisher = { Hugging Face }
306
  }
307
  ```
308
 
309
- ## Model Card Authors
310
-
311
- Dirk Pawlaszczyk
312
-
313
- ## Model Card Contact
314
-
315
- For questions or issues, please open an issue on the (https://github.com/pawlaszczyk/fqlite) or contact pawlaszc@hs-mittweida.de.
316
-
317
  ## License
318
 
319
- This model is released under the Apache 2.0 License, following the base Llama 3.2 license.
320
 
321
  ## Acknowledgments
322
 
323
- - Base model: Meta's Llama 3.2 3B Instruct
324
- - Training framework: Hugging Face Transformers, PEFT
325
- - Dataset creation: Custom forensic database schemas
326
- - Inspiration: Text-to-SQL research community
327
 
328
  ## Additional Resources
329
 
330
- - **Dataset:** pawlaszc/mobile-forensics-sql
331
- - **GitHub:** https://github.com/pawlaszczyk/fqlite
332
- - **Paper:** [Link when published]
 
333
 
334
  ---
335
 
336
- **Disclaimer:** This model is intended for research and educational purposes. Always validate generated SQL queries before execution in production forensic investigations. The model may produce incorrect queries that could lead to data loss or incorrect conclusions if used without proper review.
 
 
 
 
1
  ---
 
2
  language:
3
  - en
4
+ license: apache-2.0
5
+ library_name: transformers
6
  tags:
7
+ - sql
8
  - forensics
9
+ - text-to-sql
10
+ - llama
11
+ - fine-tuned
12
+ base_model: unsloth/Llama-3.2-3B-Instruct
13
  datasets:
14
  - pawlaszc/mobile-forensics-sql
15
  metrics:
 
22
  name: Text-to-SQL Generation
23
  dataset:
24
  type: mobile-forensics
25
+ name: SQLiteDS — Mobile Forensics SQL Dataset (corrected)
26
  metrics:
27
  - type: accuracy
28
+ value: 91.0
29
+ name: Overall Accuracy (without app name)
30
  - type: accuracy
31
+ value: 95.1
32
  name: Easy Queries Accuracy
33
  - type: accuracy
34
+ value: 87.5
35
  name: Medium Queries Accuracy
36
  - type: accuracy
37
+ value: 88.9
38
  name: Hard Queries Accuracy
 
 
39
  ---
40
 
41
  # ForensicSQL-Llama-3.2-3B
42
 
43
  ## Model Description
44
 
45
+ **ForSQLiteLM** (ForensicSQL-Llama-3.2-3B) is a fine-tuned Llama 3.2-3B model specialized
46
+ for generating SQLite queries from natural language requests against mobile forensic databases.
47
+ The model converts investigative questions into executable SQL queries across a wide range of
48
+ forensic artifact databases — WhatsApp, Signal, iMessage, Android SMS, iOS Health, WeChat,
49
+ Instagram, blockchain wallets, and many more.
50
 
51
+ This model was developed as part of a master's thesis and accompanying journal paper
52
+ investigating LLM fine-tuning for forensic database analysis, and is integrated into
53
+ [FQLite](https://github.com/pawlaszczyk/fqlite), an established open-source forensic
54
+ analysis tool.
55
+
56
+ > **Key result:** 91.0% execution accuracy on a 100-example held-out test set — within
57
+ > 4 percentage points of GPT-4o (95.0%) evaluated under identical conditions
58
+ > (McNemar test: p ≈ 0.39, not significant at α = 0.05), while running fully locally
59
+ > with no internet connectivity required.
60
 
61
  ## Model Details
62
 
63
+ | Property | Value |
64
+ |---|---|
65
+ | **Base Model** | meta-llama/Llama-3.2-3B-Instruct |
66
+ | **Fine-tuning Method** | Full fine-tune (bf16) |
67
+ | **Training Dataset** | SQLiteDS — 800 training examples, 191 forensic artifact categories |
68
+ | **Training Framework** | Hugging Face Transformers |
69
+ | **Best Val Loss** | 0.3043 (7 epochs) |
70
+ | **Model Size (bf16)** | ~6 GB |
71
+ | **Hardware Required** | 16 GB unified memory (Apple M-series) or equivalent GPU |
72
 
73
  ## Performance
74
 
75
+ ### Overall Results (fixed dataset, n=100, best configuration)
76
+
77
+ | Metric | Value |
78
+ |---|---|
79
+ | **Overall Accuracy** | **91.0%** (91/100) |
80
+ | 95% CI (Wilson) | [83.8%, 95.2%] |
81
+ | Executable Queries | 92/100 |
82
+ | GPT-4o Accuracy | 95.0% (gap: 4 pp, p ≈ 0.39) |
83
+ | Base Model (no fine-tuning) | 35.0% |
84
+ | Improvement over base | +56 pp |
85
+
86
+ ### Accuracy by Query Difficulty
87
+
88
+ | Difficulty | Accuracy | n | 95% CI | vs. GPT-4o |
89
+ |---|---|---|---|---|
90
+ | Easy (single-table) | **95.1%** | 39/41 | [83.9%, 98.7%] | 0.0 pp |
91
+ | Medium (joins, aggregation) | **87.5%** | 28/32 | [71.9%, 95.0%] | 0.0 pp |
92
+ | Hard (CTEs, window functions) | **88.9%** | 24/27 | [71.9%, 96.1%] | −3.7 pp |
93
+
94
+ ForSQLiteLM matches GPT-4o exactly on Easy and Medium queries. The remaining gap
95
+ is concentrated on Hard queries (complex CTEs, window functions, multi-table joins).
96
+
97
+ ### Accuracy by Forensic Domain
98
+
99
+ | Domain | Accuracy | n | 95% CI |
100
+ |---|---|---|---|
101
+ | Messaging & Social | **100.0%** | 28/28 | [87.9%, 100.0%] |
102
+ | Android Artifacts | **94.4%** | 17/18 | [74.2%, 99.0%] |
103
+ | Productivity & Other | **88.9%** | 16/18 | [67.2%, 96.9%] |
104
+ | iOS CoreData | **84.0%** | 21/25 | [65.3%, 93.6%] |
105
+ | Finance & Crypto | **81.8%** | 9/11 | [52.3%, 94.9%] |
106
+
107
+ ### Prompt Configuration Ablation
108
+
109
+ | Configuration | Overall | Easy | Medium | Hard | iOS |
110
+ |---|---|---|---|---|---|
111
+ | **WITHOUT App Name** ★ | **91.0%** | **95.1%** | 87.5% | **88.9%** | 84.0% |
112
+ | WITH App Name | 88.0% | 92.7% | 87.5% | 81.5% | **88.0%** |
113
+
114
+ ★ Primary configuration — omitting the application name from the prompt yields
115
+ 3 pp higher overall accuracy. Interestingly, including the app name helps iOS
116
+ CoreData schemas (+4 pp) but hurts Hard queries (−7.4 pp); the primary
117
+ configuration without app name is recommended for general use.
118
+
119
+ ### Post-Processing Pipeline Contribution
120
+
121
+ | Component | Queries saved |
122
+ |---|---|
123
+ | Execution feedback (retry) | 7 |
124
+ | Alias normalization | 18 |
125
+ | Column corrections (Levenshtein) | 2 |
126
+
127
+ ### Training Progression
128
+
129
+ | Configuration | Val Loss | Accuracy | Δ |
130
+ |---|---|---|---|
131
+ | Base model (no fine-tuning) | — | 35.0% | — |
132
+ | Fine-tuned, no augmentation | — | 68.0% | +33 pp |
133
+ | + Data augmentation (3.4×) | — | 74.0% | +6 pp |
134
+ | + Extended training (7 epochs) | 0.3617 | 84.0% | +10 pp |
135
+ | + Post-processing pipeline | 0.3617 | 87.0% | +3 pp |
136
+ | + Execution feedback | 0.3617 | 90.0% | +3 pp |
137
+ | + Corrected training dataset (v5) | **0.3043** | **91.0%** | +1 pp |
138
 
139
  ## Intended Use
140
 
141
  ### Primary Use Cases
142
+ - Mobile forensics investigations: automated SQL query drafting against seized device databases
143
+ - Integration into forensic tools (FQLite, Autopsy, ALEAPP/iLEAPP workflows)
144
+ - Research in domain-specific Text-to-SQL
145
+ - Educational use for learning forensic database analysis
146
+
147
+ ### Important: This Model is a Drafting Assistant
148
+
149
+ > **ForSQLiteLM is not a replacement for SQL expertise.** It generates candidate queries
150
+ > that require review by a practitioner with sufficient SQL knowledge before any reliance
151
+ > is placed on their results. The 91.0% accuracy means approximately **1 in 11 queries
152
+ > contains an error**. In court-admissible or case-critical work, all outputs must be
153
+ > independently validated.
154
 
155
  ### Out-of-Scope Use
156
+ - Autonomous forensic decision-making without human review
157
+ - Production systems requiring >95% guaranteed accuracy
158
+ - General-purpose SQL generation outside the forensic domain
159
+ - Non-SQLite databases (PostgreSQL, MySQL, etc.)
160
 
161
  ## How to Use
162
 
 
166
  from transformers import AutoModelForCausalLM, AutoTokenizer
167
  import torch
168
 
 
169
  model_name = "pawlaszc/ForensicSQL-Llama-3.2-3B"
170
  tokenizer = AutoTokenizer.from_pretrained(model_name)
171
  model = AutoModelForCausalLM.from_pretrained(
172
  model_name,
173
+ torch_dtype=torch.bfloat16,
174
  device_map="auto"
175
  )
176
+ model.eval()
177
 
 
178
  schema = """
179
+ CREATE TABLE message (
180
+ ROWID INTEGER PRIMARY KEY,
181
+ text TEXT,
182
+ handle_id INTEGER,
183
  date INTEGER,
184
+ is_from_me INTEGER,
185
+ cache_has_attachments INTEGER
186
+ );
187
+ CREATE TABLE handle (
188
+ ROWID INTEGER PRIMARY KEY,
189
+ id TEXT,
190
+ service TEXT
191
  );
192
  """
193
 
194
+ request = "Find all messages received in the last 7 days that contain attachments"
195
 
196
+ # Note: do NOT use apply_chat_template — use plain-text prompt
197
  prompt = f"""Generate a valid SQLite query for this forensic database request.
198
 
199
  Database Schema:
 
204
  SQLite Query:
205
  """
206
 
 
207
  inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
208
  inputs = {k: v.to(model.device) for k, v in inputs.items()}
209
 
210
  with torch.no_grad():
211
  outputs = model.generate(
212
  **inputs,
213
+ max_new_tokens=300,
214
+ do_sample=False, # greedy decoding — do not change
215
  )
216
 
 
217
  input_length = inputs['input_ids'].shape[1]
218
+ sql = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True)
 
 
219
  print(sql.strip())
 
 
 
 
 
 
 
 
 
 
 
 
220
  ```
221
 
222
+ > **Important:** Use plain-text tokenization (do **not** call `apply_chat_template`).
223
+ > The model was trained and evaluated with a plain-text prompt format.
224
+ > Use `do_sample=False` (greedy decoding) for reproducible results.
 
 
 
 
 
 
 
 
 
 
225
 
226
  ### Python Helper Class
227
 
228
  ```python
229
  class ForensicSQLGenerator:
230
+ def __init__(self, model_name="pawlaszc/ForensicSQL-Llama-3.2-3B"):
231
  from transformers import AutoModelForCausalLM, AutoTokenizer
232
  import torch
233
+
234
  self.tokenizer = AutoTokenizer.from_pretrained(model_name)
235
  self.model = AutoModelForCausalLM.from_pretrained(
236
  model_name,
237
+ torch_dtype=torch.bfloat16,
238
  device_map="auto"
239
  )
240
  self.model.eval()
 
 
 
 
 
 
241
 
242
+ def generate_sql(self, schema: str, request: str) -> str:
243
+ prompt = (
244
+ "Generate a valid SQLite query for this forensic database request.\n\n"
245
+ f"Database Schema:\n{schema}\n\n"
246
+ f"Request: {request}\n\n"
247
+ "SQLite Query:\n"
248
+ )
249
  inputs = self.tokenizer(
250
+ prompt, return_tensors="pt", truncation=True, max_length=2048
 
 
 
251
  )
252
  inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
253
+ input_length = inputs["input_ids"].shape[1]
254
+
 
255
  with torch.no_grad():
256
  outputs = self.model.generate(
257
+ **inputs, max_new_tokens=300, do_sample=False
 
 
258
  )
259
+
260
+ sql = self.tokenizer.decode(
261
+ outputs[0][input_length:], skip_special_tokens=True
262
+ )
263
+ # Return first statement only, normalized
264
  return sql.strip().split("\n")[0].strip().rstrip(";") + ";"
265
 
266
+
267
  # Usage
268
  generator = ForensicSQLGenerator()
269
+ sql = generator.generate_sql(schema, "Find all unread messages from the last 24 hours")
270
+ print(sql)
271
  ```
272
 
273
+ ### With Ollama / llama.cpp (GGUF)
274
+
275
+ ```bash
276
+ # With llama.cpp
277
+ ./llama-cli -m forensic-sql-q4_k_m.gguf \
278
+ --temp 0 \
279
+ -p "Generate a valid SQLite query for this forensic database request.
280
 
281
+ Database Schema:
282
+ CREATE TABLE sms (_id INTEGER PRIMARY KEY, address TEXT, body TEXT, date INTEGER);
 
 
 
283
 
284
+ Request: Find all messages sent after midnight
 
 
 
 
 
 
 
 
 
 
 
 
285
 
286
+ SQLite Query:"
287
 
288
+ # With Ollama create a Modelfile
289
+ cat > Modelfile << 'EOF'
290
+ FROM ./forensic-sql-q4_k_m.gguf
291
+ PARAMETER temperature 0
292
+ PARAMETER num_predict 300
293
+ EOF
294
 
295
+ ollama create forensic-sql -f Modelfile
296
+ ollama run forensic-sql
297
+ ```
298
 
299
+ ## Training Details
300
 
301
+ ### Dataset SQLiteDS
302
+
303
+ - **Total examples:** 1,000 (800 train / 100 val / 100 test), fixed random seed 42
304
+ - **Forensic artifact categories:** 191
305
+ - **Reference query validation:** All 1,000 reference queries validated for execution
306
+ correctness against in-memory SQLite; 50 queries (5%) corrected before final training
307
+ - **Augmentation:** 3.4× expansion via instruction paraphrasing, WHERE clause reordering,
308
+ and LIMIT injection — augmented examples confined to training split only
309
+ - **Dataset:** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
310
+ - **License:** CC BY 4.0
311
+
312
+ ### Hyperparameters
313
+
314
+ | Parameter | Value |
315
+ |---|---|
316
+ | Training method | Full fine-tune (no LoRA) |
317
+ | Precision | bfloat16 |
318
+ | Epochs | 7 |
319
+ | Learning rate | 2e-5 (peak) |
320
+ | LR scheduler | Cosine with warmup |
321
+ | Batch size | 1 + gradient accumulation 4 |
322
+ | Max sequence length | 2048 |
323
+ | Optimizer | AdamW |
324
+ | Hardware | Apple M-series, 16 GB unified memory |
325
+ | Training time | ~17.6 hours |
326
+ | Best val loss | 0.3043 (epoch 7) |
327
+
328
+ ### Key Training Insight: Sequence Length
329
+
330
+ Early training runs with `max_seq_length=512` truncated 92% of examples, causing
331
+ the model to learn schema generation (CREATE TABLE) instead of queries — resulting
332
+ in only ~50% accuracy. Setting `max_seq_length=2048` eliminated truncation and
333
+ improved accuracy from 50% to 68% before augmentation, and to 91% after all
334
+ training components were applied.
335
 
336
  ## Limitations
337
 
338
  ### Known Issues
 
 
 
339
 
340
+ 1. **iOS CoreData Schemas (84.0%):** The Z-prefix column naming convention
341
+ (e.g., `ZISFROMME`, `ZTIMESTAMP`) provides no semantic signal from column
342
+ names alone, making these schemas harder to reason about.
343
+ 2. **Hard Queries 3.7 pp gap to GPT-4o:** Complex CTEs, recursive queries,
344
+ and window functions are the primary remaining challenge.
345
+ 3. **Finance & Crypto (81.8%, n=11):** Small test set; confidence intervals are
346
+ wide. Interpret with caution.
347
+ 4. **~1 in 11 error rate:** Approximately 9% of generated queries will contain
348
+ errors. Expert review of all outputs is required before use in investigations.
349
+
350
+ ### When Human Review is Especially Important
351
+ - Complex multi-table queries with CTEs or window functions
352
+ - Case-critical or court-admissible investigations
353
+ - Any query that will be used to draw conclusions about a suspect
354
+ - Queries involving rare or unusual forensic artifact schemas
355
 
356
  ## Evaluation
357
 
358
+ - **Test set:** 100 examples, held-out, seed=42, non-augmented
359
+ - **Metric:** Execution accuracy query is correct iff it executes without error
360
+ AND returns a result set identical to the reference query
361
+ - **Reference validation:** All reference queries validated for execution correctness
362
+ before evaluation; 5 broken queries in the test set were corrected
363
+ - **Evaluation script:** Available in the dataset repository on Zenodo ([DOI])
 
 
 
 
 
 
364
 
365
  ## Citation
366
 
367
+ If you use this model or the SQLiteDS dataset in your research, please cite:
368
 
369
  ```bibtex
370
+ @article{pawlaszczyk2026forsqlitelm,
371
+ author = {Dirk Pawlaszczyk},
372
+ title = {AI-Based Automated SQL Query Generation for SQLite Databases
373
+ in Mobile Forensics},
374
+ journal = {Forensic Science International: Digital Investigation},
375
+ year = {2026},
376
+ note = {FSIDI-D-26-00029}
377
  }
378
  ```
379
 
 
 
 
 
 
 
 
 
380
  ## License
381
 
382
+ Apache 2.0 following the base Llama 3.2 license terms.
383
 
384
  ## Acknowledgments
385
 
386
+ - Base model: Meta's Llama 3.2-3B-Instruct
387
+ - Training framework: Hugging Face Transformers
388
+ - Forensic tool integration: [FQLite](https://github.com/pawlaszczyk/fqlite)
389
+ - Schema sources: iLEAPP, ALEAPP, Autopsy (used under their respective open-source licenses)
390
 
391
  ## Additional Resources
392
 
393
+ - **Dataset (Zenodo):** [SQLiteDS — DOI to be added on publication]
394
+ - **Dataset (HuggingFace):** [pawlaszc/mobile-forensics-sql](https://huggingface.co/datasets/pawlaszc/mobile-forensics-sql)
395
+ - **FQLite integration:** [github.com/pawlaszczyk/fqlite](https://github.com/pawlaszczyk/fqlite)
396
+ - **Paper:** FSIDI-D-26-00029 (under review)
397
 
398
  ---
399
 
400
+ **Disclaimer:** ForSQLiteLM is intended for research and forensic practitioner use.
401
+ All generated SQL queries must be reviewed by a qualified practitioner before
402
+ execution in live forensic investigations. The authors accept no liability for
403
+ incorrect conclusions drawn from unvalidated model outputs.