EMBER2024 Malware Detection Models

4์ข… ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜(DNN, TabNet, Hybrid GBDT2NN, LightGBM)๋ฅผ EMBER2024 ๋ฐ์ดํ„ฐ์…‹์˜ 8๊ฐœ ํŒŒ์ผ ํƒ€์ž… subset ์ „์ฒด์— ๋Œ€ํ•ด ํ•™์Šตยทํ‰๊ฐ€ํ•˜๊ณ  ๋ฐฐํฌ ๊ฐ€๋Šฅํ•œ ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋ชจ๋ธ ์ปฌ๋ ‰์…˜์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ํ™˜๊ฒฝ: NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB ํ†ตํ•ฉ ๋ฉ”๋ชจ๋ฆฌ, CUDA 13)
์ฝ”๋“œ: github.com/evan0416/ember-ml
๋ฐ์ดํ„ฐ์…‹ ๋…ผ๋ฌธ: Joyce et al., KDD 2025 (arXiv:2506.05074)


๋ชจ๋ธ ๊ตฌ์„ฑ

๋””๋ ‰ํ† ๋ฆฌ ์•„ํ‚คํ…์ฒ˜ ๋ฐฐํฌ ํฌ๋งท ํŒŒ๋ผ๋ฏธํ„ฐ
dnn/ Feed-Forward DNN (PReLU + Dropout) ONNX (INT8 Static / FP32) 13.2 M (PE) / 0.73 M (non-PE)
tabnet/ TabNet (Arik & Pfister, 2021) ONNX FP32 ~3 M
hybrid/ GBDT2NN (DeepGBM, KDD 2019) ONNX (nn_part) + LightGBM booster ~1 M NN
lightgbm/ LightGBM (์‚ฌ์ „ํ•™์Šต, joyce8/EMBER2024-benchmark-models) Treelite .tl โ€”

Subset ๋ชฉ๋ก

Subset ๋Œ€์ƒ ํŒŒ์ผ ํƒ€์ž… ์ž…๋ ฅ ์ฐจ์›
PE PE ๋ฐ”์ด๋„ˆ๋ฆฌ ์ „์ฒด (Win32 + Win64 + .NET) 2,568
Win32 Windows 32-bit PE 2,568
Win64 Windows 64-bit PE 2,568
.NET .NET ์–ด์…ˆ๋ธ”๋ฆฌ 2,568
APK Android APK 696
ELF Linux ELF 696
PDF PDF ๋ฌธ์„œ 696
all ์ „์ฒด ํŒŒ์ผ ํƒ€์ž… ํ˜ผํ•ฉ 2,568

๋””๋ ‰ํ† ๋ฆฌ ๊ตฌ์กฐ

ํŒŒ์ผ๋ช… ๊ทœ์น™: {๋ชจ๋ธ}_{subset}[_suffix].{ext}
.NET subset์€ ํŒŒ์ผ๋ช…์—์„œ dotnet์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค.

dnn/
โ”œโ”€โ”€ dnn_PE.onnx              # INT8 Static (๋ฐฐํฌ์šฉ, PE/Win32/Win64/dotnet/all)
โ”œโ”€โ”€ dnn_PE_fp32.onnx         # FP32 ONNX   (์ฐธ์กฐ์šฉ, INT8 ๊ณ„์—ด๋งŒ ์ถ”๊ฐ€ ํฌํ•จ)
โ”œโ”€โ”€ dnn_PE.pt                # PyTorch ์ฒดํฌํฌ์ธํŠธ
โ”œโ”€โ”€ dnn_PE_metrics.json      # ํ‰๊ฐ€ ๊ฒฐ๊ณผ (AUC, TPR@1%FPR)
โ”œโ”€โ”€ dnn_PE_benchmark.json    # ํฌ๊ธฐยท๋ ˆ์ดํ„ด์‹œ
โ”œโ”€โ”€ dnn_APK.onnx             # FP32 (non-PE โ€” INT8 AUC ์†์‹ค ๊ณผ๋‹ค)
โ”œโ”€โ”€ dnn_APK.pt
โ””โ”€โ”€ ...

tabnet/
โ”œโ”€โ”€ tabnet_PE.onnx           # FP32 ONNX (134 MB โ€” sparsemax ์–ธํด๋”ฉ)
โ”œโ”€โ”€ tabnet_PE.zip            # pytorch-tabnet ๋„ค์ดํ‹ฐ๋ธŒ (7 MB, ๊ฒฝ๋Ÿ‰)
โ””โ”€โ”€ ...

hybrid/
โ”œโ”€โ”€ hybrid_PE_nnpart.onnx    # GBDT2NN nn_part ONNX (5.1 MB)
โ”œโ”€โ”€ hybrid_PE_lgbm.model     # LightGBM booster (3.6 MB)
โ”œโ”€โ”€ hybrid_PE.pt             # PyTorch ์ฒดํฌํฌ์ธํŠธ
โ””โ”€โ”€ ...

lightgbm/
โ”œโ”€โ”€ lightgbm_PE.tl           # Treelite ์ง๋ ฌํ™” (ํ”Œ๋žซํผ ๋…๋ฆฝ, ์žฌ์ปดํŒŒ์ผ ํ•„์š”)
โ””โ”€โ”€ ...

์„ฑ๋Šฅ ๊ฒฐ๊ณผ (EMBER2024 test set)

ํ‰๊ฐ€ ๊ธฐ์ค€: ROC-AUC, TPR @ 1% FPR (๋…ผ๋ฌธ ยง4.1), challenge set detection rate @ FPR=1% ์ž„๊ณ„๊ฐ’
Challenge set: 6,315 evasive malware (์–‘์„ฑ only, Win32 3,225 / .NET 829 / Win64 814 / PDF 805 / ELF 386 / APK 256)

DNN

Subset ROC-AUC TPR@1%FPR ๋ฐฐํฌ ํฌ๋งท ํฌ๊ธฐ
PE 0.9969 0.9539 INT8 Static ONNX 13.3 MB
Win32 0.9966 0.9468 INT8 Static ONNX 13.3 MB
Win64 0.9969 0.9656 INT8 Static ONNX 13.3 MB
.NET 0.9939 0.8976 INT8 Static ONNX 13.3 MB
all 0.9942 0.9103 INT8 Static ONNX 13.3 MB
APK 0.9761 0.7682 FP32 ONNX 3.9 MB
ELF 0.9840 0.8103 FP32 ONNX 3.9 MB
PDF 0.9795 0.8902 FP32 ONNX 3.9 MB

non-PE(APK/ELF/PDF)๋Š” ์ž…๋ ฅ 696-dim, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ ๋ถ€์กฑ์œผ๋กœ INT8 AUC ์†์‹ค์ด ํฌ๋ฏ€๋กœ FP32 ์œ ์ง€.

TabNet

Subset ROC-AUC TPR@1%FPR ๋ฐฐํฌ ํฌ๋งท ํฌ๊ธฐ
PE 0.9951 0.9212 FP32 ONNX 134 MB
Win32 0.9949 0.9322 FP32 ONNX 134 MB
Win64 0.9946 0.9323 FP32 ONNX 134 MB
.NET 0.9925 0.8702 FP32 ONNX 134 MB
all 0.9922 0.8912 FP32 ONNX 134 MB
APK 0.9741 0.7028 FP32 ONNX 13.5 MB
ELF 0.9793 0.5460 FP32 ONNX 13.5 MB
PDF 0.9810 0.8597 FP32 ONNX 13.5 MB

PE ๊ณ„์—ด ONNX ํฌ๊ธฐ 134 MB: sparsemax attention loop์ด ONNX ๊ทธ๋ž˜ํ”„๋กœ ์–ธํด๋”ฉ๋˜๋Š” ๊ตฌ์กฐ์  ํŠน์„ฑ. ํฌ๊ธฐ ์šฐ์„ ์ด๋ฉด tabnet_model.zip(7 MB) ์ง์ ‘ ์‚ฌ์šฉ ๊ถŒ์žฅ.

Hybrid (GBDT2NN)

Subset ROC-AUC TPR@1%FPR ๋ฐฐํฌ ํฌ๋งท ํฌ๊ธฐ
PE 0.9982 0.9752 nn_part ONNX + LightGBM booster 5.1 + 3.6 MB
Win32 0.9981 0.9747 nn_part ONNX + LightGBM booster 5.1 + 3.6 MB
Win64 0.9983 0.9812 nn_part ONNX + LightGBM booster 5.1 + 3.6 MB
.NET 0.9961 0.9466 nn_part ONNX + LightGBM booster 5.1 + 3.5 MB
all 0.9970 0.9528 nn_part ONNX + LightGBM booster 5.1 + 3.6 MB
APK 0.9821 0.8003 nn_part ONNX + LightGBM booster 5.1 + 3.5 MB
ELF 0.9899 0.8827 nn_part ONNX + LightGBM booster 5.1 + 3.6 MB
PDF 0.9879 0.9283 nn_part ONNX + LightGBM booster 5.1 + 3.6 MB

LightGBM (Treelite ์ปดํŒŒ์ผ)

Subset ROC-AUC TPR@1%FPR ํฌ๊ธฐ (.tl) ํฌ๊ธฐ (์›๋ณธ .model)
PE 0.9983 0.9707 5.3 MB 3.6 MB
Win32 0.9984 0.9736 5.3 MB 3.6 MB
Win64 0.9989 0.9831 5.3 MB 3.6 MB
.NET 0.9980 0.9566 5.3 MB 3.5 MB
all 0.9968 0.9440 5.3 MB 3.6 MB
APK 0.9861 0.8157 5.3 MB 3.5 MB
ELF 0.9929 0.9140 5.3 MB 3.6 MB
PDF 0.9913 0.9275 5.3 MB 3.6 MB

์›๋ณธ LightGBM ๋ชจ๋ธ: joyce8/EMBER2024-benchmark-models. .tl์€ Treelite 3.9.1๋กœ ์ง๋ ฌํ™”๋œ ํ”Œ๋žซํผ ๋…๋ฆฝ ํŒŒ์ผ โ€” ๊ฐ ํ”Œ๋žซํผ์—์„œ ์žฌ์ปดํŒŒ์ผ ํ•„์š”.

Challenge Set Detection Rate

Challenge set: 6,315 evasive malware (์ „๋ถ€ ์–‘์„ฑ). test set FPR=1% ์ž„๊ณ„๊ฐ’ ์ ์šฉ.

Subset DNN TabNet Hybrid LightGBM
.NET 58.6% 70.0% 80.6% 79.6%
APK 27.3% 29.3% 34.4% 33.6%
ELF 11.7% 4.4% 23.8% 30.3%
PDF 41.5% 40.1% 56.9% 57.1%
PE 38.5% 36.9% 58.2% 58.8%
Win32 36.6% 45.3% 58.4% 69.9%
Win64 46.3% 44.1% 59.5% 59.7%
all 35.3% 42.3% 54.1% 48.4%

์ถ”๋ก  ์„ฑ๋Šฅ (Apple M1, darwin-arm64)

warm_batch1 ๋ ˆ์ดํ„ด์‹œ: ๋ฐฐ์น˜ ํฌ๊ธฐ=1, ์บ์‹œ ์›œ์—… ํ›„ ์ธก์ •. ๋ฐฐํฌ ํ™˜๊ฒฝ(x86_64 Linux)๊ณผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ.

๋ ˆ์ดํ„ด์‹œ (ms, warm batch=1)

Subset DNN TabNet Hybrid LightGBM
.NET 0.248 5.465 0.151 0.050
APK 0.035 0.846 0.145 0.031
ELF 0.039 0.505 0.160 0.036
PDF 0.036 2.230 0.172 0.048
PE 0.290 4.402 0.138 0.028
Win32 0.288 4.693 0.141 0.044
Win64 0.220 5.621 0.422 0.039
all 0.254 4.788 0.147 0.068

TabNet ๋ ˆ์ดํ„ด์‹œ ๋†’์Œ: sparsemax attention์ด ONNX ๊ทธ๋ž˜ํ”„๋กœ ์–ธํด๋”ฉ๋˜๋Š” ๊ตฌ์กฐ์  ํŠน์„ฑ.
Hybrid = nn_part ONNX ์ถ”๋ก ๋งŒ ์ธก์ • (LightGBM leaf extraction ์ œ์™ธ).
LightGBM ๋ ˆ์ดํ„ด์‹œ = ์ปดํŒŒ์ผ .dylib ๊ธฐ์ค€; ์—…๋กœ๋“œ ํŒŒ์ผ์€ .tl (์žฌ์ปดํŒŒ์ผ ํ•„์š”).

๋ชจ๋ธ ํŒŒ์ผ ํฌ๊ธฐ (๋ฐฐํฌ ํฌ๋งท)

Subset DNN TabNet .onnx TabNet .zip Hybrid (nn+lgbm) LightGBM .tl
PE ๊ณ„์—ด 13.3 MB (INT8) 140.2 MB 7.4 MB 5.3 + 3.8 MB 5.3 MB
non-PE 3.9 MB (FP32) 13.5 MB 3.2 MB 5.3 + 3.7 MB 5.3 MB

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

์˜์กด์„ฑ ์„ค์น˜

pip install onnxruntime>=1.20 numpy
# LightGBM / Hybrid ์ถ”๋ก  ์‹œ
pip install "treelite==3.9.1" "treelite_runtime==3.9.1" lightgbm>=4.6
# TabNet ์ฒดํฌํฌ์ธํŠธ ์ง์ ‘ ์‚ฌ์šฉ ์‹œ
pip install pytorch-tabnet>=4.1

DNN ์ถ”๋ก  (ONNX Runtime)

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download

# PE subset โ€” INT8 Static
model_path = hf_hub_download(
    repo_id="cycloevan/ember-model",
    filename="dnn/dnn_PE.onnx",
)
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

# X: np.ndarray shape (N, 2568), dtype float32
X = np.random.randn(1, 2568).astype(np.float32)
logit = sess.run(["logit"], {"features": X})[0]          # shape (N, 1)
prob  = 1 / (1 + np.exp(-logit.ravel()))                  # sigmoid โ†’ [0, 1]
print(f"malware probability: {prob[0]:.4f}")
# APK subset โ€” FP32
model_path = hf_hub_download(
    repo_id="cycloevan/ember-model",
    filename="dnn/dnn_APK.onnx",
)
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
X = np.random.randn(1, 696).astype(np.float32)          # non-PE: dim=696
prob = 1 / (1 + np.exp(-sess.run(["logit"], {"features": X})[0].ravel()))

TabNet ์ถ”๋ก  (ONNX Runtime)

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="cycloevan/ember-model",
    filename="tabnet/tabnet_PE.onnx",
)
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
X = np.random.randn(1, 2568).astype(np.float32)
# ์ถœ๋ ฅ: logit (sigmoid ์ „)
logit = sess.run(["logit"], {"features": X})[0]
prob  = 1 / (1 + np.exp(-logit.ravel()))

Hybrid ์ถ”๋ก  (ONNX + LightGBM)

import numpy as np
import lightgbm as lgb
import onnxruntime as ort
from huggingface_hub import hf_hub_download

# 1. LightGBM booster๋กœ leaf indices ์ถ”์ถœ
booster = lgb.Booster(model_file=hf_hub_download(
    repo_id="cycloevan/ember-model",
    filename="hybrid/hybrid_PE_lgbm.model",
))
X_raw = np.random.randn(1, 2568).astype(np.float64)
leaf_indices = booster.predict(X_raw, pred_leaf=True).astype(np.int64)  # (N, n_trees)

# 2. GBDT2NN ONNX๋กœ ์ตœ์ข… ๋ถ„๋ฅ˜
nn_sess = ort.InferenceSession(hf_hub_download(
    repo_id="cycloevan/ember-model",
    filename="hybrid/hybrid_PE_nnpart.onnx",
), providers=["CPUExecutionProvider"])
logit = nn_sess.run(["logit"], {"leaf_indices": leaf_indices})[0]
prob  = 1 / (1 + np.exp(-logit.ravel()))
print(f"malware probability: {prob[0]:.4f}")

LightGBM ์ถ”๋ก  (Treelite ์ปดํŒŒ์ผ โ€” ๋น ๋ฅธ ์ถ”๋ก )

# 1. Treelite .tl โ†’ ํ”Œ๋žซํผ๋ณ„ ๊ณต์œ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ปดํŒŒ์ผ (์ตœ์ดˆ 1ํšŒ)
import treelite, treelite_runtime, sys, numpy as np
from huggingface_hub import hf_hub_download

tl_path = hf_hub_download(
    repo_id="cycloevan/ember-model",
    filename="lightgbm/lightgbm_PE.tl",
)
tl_model = treelite.Model.deserialize(tl_path)
lib_ext   = ".dylib" if sys.platform == "darwin" else ".so"
lib_path  = tl_path.replace(".tl", lib_ext)
tl_model.export_lib(
    toolchain="clang" if sys.platform == "darwin" else "gcc",
    libpath=lib_path,
    verbose=False,
)

# 2. ์ถ”๋ก 
predictor = treelite_runtime.Predictor(lib_path, verbose=False)
X = np.random.randn(1, 2568).astype(np.float32)
prob = predictor.predict(treelite_runtime.DMatrix(X))
print(f"malware probability: {prob[0]:.4f}")

์ฃผ์˜: treelite==3.9.1 + treelite_runtime==3.9.1 ํ•„์š”. 4.x๋Š” export_lib() ๋ฏธ์ง€์›.


ํ•™์Šต ๋ฐ ํ‰๊ฐ€ ํ™˜๊ฒฝ

ํ•ญ๋ชฉ ๋‚ด์šฉ
๋ฐ์ดํ„ฐ์…‹ EMBER2024 โ€” train 52์ฃผ(2.6 M), test 12์ฃผ(606 K), challenge 6,315
Feature ์ฐจ์› PE 2,568 (v3) / non-PE 696 (์œ ํšจ prefix)
Split ์ •์ฑ… ์‹œ๊ฐ„์  ์ˆœ์„œ ๊ณ ์ • (temporal split), ์ž„์˜ ์…”ํ”Œ ์—†์Œ
ํ•™์Šต ํ™˜๊ฒฝ DGX Spark (GB10 Grace Blackwell, 128 GB, CUDA 13)
ํ”„๋ ˆ์ž„์›Œํฌ PyTorch 2.11.0, pytorch-tabnet 4.1, LightGBM 4.6
์žฌํ˜„ ์‹œ๋“œ 42
DNN ์•„ํ‚คํ…์ฒ˜ Linear(2568โ†’2568โ†’1024โ†’512โ†’1) + PReLU + Dropout(0.5)
Hybrid LightGBM leaf extraction โ†’ Linear(n_treesโ†’512โ†’256โ†’1) + PReLU
ํ‰๊ฐ€ ์ง€ํ‘œ ROC-AUC, PR-AUC, TPR @ 1% FPR (๋…ผ๋ฌธ ยง4.1)

์•Œ๋ ค์ง„ ํ•œ๊ณ„

  • TabNet ONNX ํฌ๊ธฐ: sparsemax attention loop ์–ธํด๋”ฉ์œผ๋กœ PE ๊ณ„์—ด ONNX๊ฐ€ 134 MB๋กœ ํŒฝ์ฐฝ. ์›๋ณธ tabnet_model.zip(7 MB)์ด ๊ฒฝ๋Ÿ‰.
  • Treelite .dylib: Mac ARM64 ์ „์šฉ ์‚ฌ์ „ ์ปดํŒŒ์ผ ํŒŒ์ผ. ๋‹ค๋ฅธ ํ”Œ๋žซํผ์€ .tl์—์„œ ์žฌ์ปดํŒŒ์ผ ํ•„์š”.
  • DNN non-PE INT8: 696-dim ๋ชจ๋ธ์€ ์–‘์žํ™” AUC ์†์‹ค์ด ํฌ๋ฏ€๋กœ FP32 ์œ ์ง€.
  • Hybrid ์ถ”๋ก : ๋‹จ์ผ ONNX ํŒŒ์ผ์ด ์•„๋‹˜ โ€” LightGBM leaf extraction + nn_part ONNX 2๋‹จ๊ณ„.
  • challenge detection rate: test set์—์„œ FPR=1% ์ž„๊ณ„๊ฐ’์œผ๋กœ ์ธก์ •. subset๋ณ„ ๋ถ„ํฌ ์ฐจ์ด๋กœ ๊ฐ’์ด ์ƒ์ดํ•  ์ˆ˜ ์žˆ์Œ.

์ธ์šฉ

@inproceedings{joyce2025ember2024,
  title     = {EMBER2024: An Open Dataset for Training Behavioral Malware Detection Models},
  author    = {Joyce, Ruby and Rudd, Ethan M. and others},
  booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.05074}
}

๋ผ์ด์„ ์Šค

์ฝ”๋“œ ๋ฐ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜: Apache 2.0
LightGBM ์›๋ณธ ๋ชจ๋ธ(hybrid/hybrid_*_lgbm.model): joyce8/EMBER2024-benchmark-models ๋ผ์ด์„ ์Šค ์ค€์ˆ˜

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train cycloevan/ember-model

Papers for cycloevan/ember-model