--- language: en license: cc-by-nc-4.0 tags: - cybersecurity - malware-analysis - att&ck - threat-intelligence - mixtral - lora - peft - expert-adapters - cape-sandbox - digital-forensics library_name: peft base_model: mistralai/Mixtral-8x7B-Instruct-v0.1 inference: false metrics: - accuracy --- # **Fathom** — Specialized Cybersecurity Analysis Model **Mixtral-8x7B-Instruct-v0.1 + 10× LoRA adapters (rank=32, bf16)** **Primary adapter:** `unified-v2` (general cybersecurity + malware analysis) **9 expert adapters** for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.) **Fathom** turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB. --- ## Model Overview - **Base:** Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization) - **Training:** Direct PEFT+TRL - **Adapters:** 1 unified + 9 expert LoRA adapters (all rank=32, α=16) - **Hardware:** AMD MI300X (205.8 GB VRAM) — full bf16 training - **Key Innovation:** Evidence extraction layer + structured behavioral prompts → **9× improvement** in real ATT&CK mapping **Designed for:** - Malware analysts & threat hunters - SOC / DFIR teams - CAPE / sandbox report enrichment - Automated ATT&CK technique extraction --- ## Benchmark Results All results use the **real Fathom pipeline** (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16. ### 1. General Cybersecurity Knowledge (vs. Closed & Open Models) | Benchmark | Fathom unified-v2 | GPT-4 (ref) | GPT-3.5 (ref) | Base Mixtral-8x7B | Llama-2-70B (ref) | |----------------------------|-------------------|-------------|---------------|-------------------|-------------------| | **CyberMetric-80** | **91.25%** | ~87% | ~67% | 82.5% | ~57% | | MMLU Computer Security | **79.0%** | ~82% | ~65% | — | ~54% | | MMLU Security Studies | **64.0%** | ~74% | ~60% | — | ~48% | | TruthfulQA MC1 | **65.0%** | | | | | **Visual bar comparison (CyberMetric-80):** ``` Fathom unified-v2 ████████████████████ 91.25% GPT-4 ██████████████████ ~87% Base Mixtral █████████████████ 82.5% GPT-3.5 ██████████████ ~67% Llama-2-70B ████████████ ~57% ``` ### 2. Expert Adapter Comparison (CyberMetric-80) | Adapter | Score | Specialty | |--------------------------|---------|------------------------------------| | `unified-v2` | **91.25%** | All-domain baseline | | `expert-e8-analyst` | **91.25%** | Analyst Q&A & reporting | | `expert-e3-network` | 90.00% | Network traffic / C2 analysis | | `expert-e4-forensics` | 90.00% | Memory & disk forensics | | `expert-e6-detection` | 88.75% | Detection engineering | | `expert-e7-reports` | 88.75% | Structured report generation | | `expert-e2-dynamic` | 85.00% | Behavioral / sandbox analysis | | `expert-e1-static` | 83.75% | Static PE + evasion detection | | `expert-e9-cot` | 87.50% | Chain-of-thought reasoning | | `expert-e5-threatintel` | 81.25% | Threat intel & actor profiling | ### 3. Core Contribution: Real ATT&CK Mapping Accuracy **Progression table** (same model weights, only input pipeline improved): | Configuration | Exact F1 | Parent F1 | Improvement | |----------------------------------------|----------|-----------|-------------| | Raw API list (naive) | 0.083 | 0.095 | — | | Structured prompt (manual) | 0.370 | 0.429 | +0.334 | | Real Fathom evidence layer | 0.534 | 0.508 | +0.413 | | **Real pipeline + full context fix** | **0.868**| **0.841** | **+0.746** | **This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.** ### 4. Real Malware Analysis — CAPE Pipeline ( malscore 10/10 samples) | Sample | Family | GT T-codes | Predicted T-codes | Exact F1 | Parent F1 | Family ID | |--------|----------|-----------------------------|--------------------------------------------|----------|-----------|-----------| | 12 | Emotet | T1012, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | 0.889 | 0.857 | 100% conf | | 15 | Formbook | T1012, T1055, T1071, T1071.004, T1083 | T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 | 0.714 | 0.667 | 85% conf | | 16 | Dridex | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | **1.000**| **1.000** | 68% conf | | **Average** | | | | **0.868**| **0.841** | — | ### 5. Additional Benchmarks - **ATT&CK Mapping MCQ (30 handcrafted questions):** 80% - **MMLU Machine Learning:** 60% - **MMLU Electrical Engineering:** 64% - **Rigorous ground-truth F1 (23 test cases):** Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes ### 5. Key Discovery: Mal-API-2019 Analysis We evaluated Fathom on the public **Mal-API-2019** dataset (Catak & Yazı, arXiv:1905.01999) — 7,107 API call sequences from Cuckoo Sandbox. | Variant | Accuracy | Macro F1 | |--------------------------|----------|----------| | Raw API sequences | 12.6% | 0.030 | | Filtered behavioral groups | 10.9% | 0.052 | ### Insight: Raw API sequences alone are insufficient for reliable family classification. The dataset contains heavy loader noise and families share nearly identical behavioral APIs. Ground-truth labels come from static AV signatures, not behavioral semantics. > “ In contrast, Fathom’s full evidence extraction pipeline achieves 0.841 Parent F1 on real CAPEv2 reports. This demonstrates that structured behavioral evidence + multi-source context (not raw API text) is the critical enabler for production-grade malware analysis.” --- ## How to Use ### Loading the unified model (recommended for most users) ```python from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1" adapter = "umer07/fathom-mixtral" # unified-v2 at root tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2") model.eval() ``` --- ## Limitations - Sub-technique precision lower than parent techniques (standard across all LLMs) - Family identification improves significantly with KSPN enrichment - Rare/exotic TTPs (UAC bypass, ICMP C2) have low recall - Prompt injection / attribution hallucination remains a base-model weakness (mitigable with system prompt hardening) --- ## Training & Datasets - **Unified-v2:** 123,912 rows (1 epoch) - **Experts:** 9 specialized datasets (total > 200k rows after augmentation) - **Evasive dataset (NEW):** 25,160 obfuscated C++ samples (92 evasion combinations) - **ThreatIntel upgrade:** 9,532 rows (URLhaus + GTFOBins + MITRE CTI) --- ## Citation ```bibtex @misc{fathom2026, title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters}, author={Umer}, year={2026}, howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}}, } ```