Update README.md

ee43310 verified about 1 year ago

4.89 kB

	---
	library_name: transformers
	tags:
	- cybersecurity
	- mpnet
	- classification
	- fine-tuned
	---

	# Model Card for MPNet Cybersecurity Classifier

	This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques.

	## Model Details

	### Model Description

	This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs).

	- Developed by: Dženan Hamzić
	- Model type: Transformer-based classification model (MPNet)
	- Language(s) (NLP): English
	- License: Apache-2.0
	- Finetuned from model: microsoft/mpnet-base (with intermediate MLM fine-tuning)

	### Model Sources

	- Base Model: [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base)

	## Uses

	### Direct Use

	This model classifies textual cybersecurity descriptions into known cybersecurity threat groups.

	### Downstream Use

	Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems.

	### Out-of-Scope Use

	- General language tasks unrelated to cybersecurity
	- Tasks outside the cybersecurity domain

	## Bias, Risks, and Limitations

	This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate.

	### Recommendations

	Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios.

	## How to Get Started with the Model

	```python
	from transformers import AutoTokenizer, MPNetModel
	import torch

	model_name = "mpnet_classification_finetuned_v2"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = MPNetModel.from_pretrained(model_name)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	# Example inference
	sentence = "APT38 has used phishing emails with malicious links to distribute malware."
	inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding="max_length", max_length=128).to(device)

	with torch.no_grad():
	outputs = model(**inputs)
	cls_embedding = outputs.last_hidden_state[:, 0, :]
	predicted_class = classifier_model.classifier(cls_embedding).argmax(dim=1).cpu().item()

	print(f"Predicted GroupID: {predicted_class}")
	```

	## Training Details

	### Training Data

	The training dataset comprises balanced textual descriptions of various cybersecurity threat groups' TTPs, augmented through synonym replacement to increase diversity.

	### Training Procedure

	- Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2")
	- Epochs: 20
	- Learning rate: 5e-6
	- Batch size: 16

	## Evaluation

	### Testing Data, Factors & Metrics

	- Testing Data: Stratified sample from original dataset.
	- Metrics: Accuracy, Weighted F1 Score

	### Results

	\| Metric \| Value \|
	\|------------------------\|---------\|
	\| Classification Accuracy (Test) \| 0.7161 \|
	\| Weighted F1 Score \| [More Information Needed] \|

	### Single Prediction Example

	```python

	# Create explicit mapping from numeric labels to original GroupIDs
	label_to_groupid = dict(enumerate(train_df["GroupID"].astype("category").cat.categories))

	def predict_group(sentence):
	classifier_model.eval()
	encoding = tokenizer(
	sentence,
	truncation=True,
	padding="max_length",
	max_length=128,
	return_tensors="pt"
	)
	input_ids = encoding["input_ids"].to(device)
	attention_mask = encoding["attention_mask"].to(device)

	with torch.no_grad():
	logits = classifier_model(input_ids, attention_mask)
	predicted_label = torch.argmax(logits, dim=1).cpu().item()


	# Explicitly convert numeric label to original GroupID
	predicted_groupid = label_to_groupid[predicted_label]
	return predicted_groupid

	sentence = "APT38 has used phishing emails with malicious links to distribute malware."
	predicted_class = predict_group(sentence)
	print(f"Predicted GroupID: {predicted_class}") # e.g., Predicted GroupID: G0081
	```

	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

	- Hardware Type: [To be filled by user]
	- Hours used: [To be filled by user]
	- Cloud Provider: [To be filled by user]
	- Compute Region: [To be filled by user]
	- Carbon Emitted: [To be filled by user]

	## Technical Specifications

	### Model Architecture

	- MPNet architecture with classification head (768 -> 512 -> num_labels)
	- Last 10 transformer layers fine-tuned explicitly

	## Environmental Impact

	Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

	## Model Card Authors

	- Dženan Hamzić

	## Model Card Contact

	- [More Information Needed]