| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - en |
| | tags: |
| | - cybersecurity |
| | widget: |
| | - text: >- |
| | Native API functions such as <mask> may be directly invoked via system calls |
| | (syscalls). However, these features are also commonly exposed to user-mode |
| | applications through interfaces and libraries. |
| | example_title: Native API functions |
| | - text: >- |
| | One way to explicitly assign the PPID of a new process is through the <mask> |
| | API call, which includes a parameter for defining the PPID. |
| | example_title: Assigning the PPID of a new process |
| | - text: >- |
| | Enable Safe DLL Search Mode to ensure that system DLLs in more restricted |
| | directories (e.g., %<mask>%) are prioritized over DLLs in less secure |
| | locations such as a user’s home directory. |
| | example_title: Enable Safe DLL Search Mode |
| | - text: >- |
| | GuLoader is a file downloader that has been active since at least December |
| | 2019. It has been used to distribute a variety of <mask>, including NETWIRE, |
| | Agent Tesla, NanoCore, and FormBook. |
| | example_title: GuLoader is a file downloader |
| | new_version: cisco-ai/SecureBERT2.0-base |
| | base_model: |
| | - ehsanaghaei/SecureBERT |
| | --- |
| | |
| | # SecureBERT+ |
| |
|
| | **SecureBERT+** is an enhanced version of [SecureBERT](https://huggingface.co/ehsanaghaei/SecureBERT), trained on a corpus **five times larger** than its predecessor and leveraging the computational power of **8×A100 GPUs**. |
| |
|
| | This model delivers an **average 6% improvement** in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain. |
| |
|
| | --- |
| |
|
| | ## Dataset |
| | SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data. |
| |
|
| |  |
| |
|
| | --- |
| |
|
| | ## Using SecureBERT+ |
| |
|
| | SecureBERT+ is available on the [Hugging Face Hub](https://huggingface.co/ehsanaghaei/SecureBERT_Plus). |
| |
|
| | ### Load the Model |
| | ```python |
| | from transformers import RobertaTokenizer, RobertaModel |
| | import torch |
| | |
| | tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
| | model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
| | |
| | inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt") |
| | outputs = model(**inputs) |
| | |
| | last_hidden_states = outputs.last_hidden_state |
| | ``` |
| |
|
| | # Masked Language Modeling Example |
| |
|
| | Use the code below to predict masked words in text: |
| | ```python |
| | #!pip install transformers torch tokenizers |
| | |
| | import torch |
| | import transformers |
| | from transformers import RobertaTokenizerFast |
| | |
| | tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
| | model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus") |
| | |
| | def predict_mask(sent, tokenizer, model, topk=10, print_results=True): |
| | token_ids = tokenizer.encode(sent, return_tensors='pt') |
| | masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist() |
| | words = [] |
| | |
| | with torch.no_grad(): |
| | output = model(token_ids) |
| | |
| | for pos in masked_pos: |
| | logits = output.logits[0, pos] |
| | top_tokens = torch.topk(logits, k=topk).indices |
| | predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens] |
| | words.append(predictions) |
| | if print_results: |
| | print(f"Mask Predictions: {predictions}") |
| | |
| | return words |
| | ``` |
| |
|
| | # Limitations & Risks |
| |
|
| | Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains. |
| |
|
| | Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies. |
| |
|
| | Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior. |
| |
|
| | Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams. |
| |
|
| | Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology. |
| |
|
| | Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals. |
| |
|
| | # Reference |
| | ``` |
| | @inproceedings{aghaei2023securebert, |
| | title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, |
| | author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, |
| | booktitle={Security and Privacy in Communication Networks: |
| | 18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, |
| | pages={39--56}, |
| | year={2023}, |
| | organization={Springer} |
| | } |
| | ``` |