| | --- |
| | library_name: sentence-transformers |
| | pipeline_tag: sentence-similarity |
| | tags: |
| | - sentence-transformers |
| | - sentence-similarity |
| | - feature-extraction |
| | datasets: |
| | - CyCraftAI/CyPHER |
| | extra_gated_fields: |
| | First Name: text |
| | Last Name: text |
| | Date of birth: date_picker |
| | Country: country |
| | Affiliation: text |
| | Job title: |
| | type: select |
| | options: |
| | - Student |
| | - Research Graduate |
| | - AI researcher |
| | - AI developer/engineer |
| | - Reporter |
| | - Other |
| | geo: ip_location |
| | --- |
| | |
| | # CmdCaliper-base |
| | ## [[Dataset](https://huggingface.co/datasets/CyCraftAI/CyPHER)] [[Code](https://github.com/cycraft-corp/CmdCaliper)] [[Paper](https://arxiv.org/abs/2411.01176)] |
| |
|
| | The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks. |
| |
|
| | CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints. |
| |
|
| | CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research". |
| |
|
| | ## Metric |
| | | Methods | Model Parameters | MRR @3 | MRR @10 | Top @3 | Top @10 | |
| | |---------------------|--------------------|--------|---------|--------|---------| |
| | | Levenshtein distance | - | 71.23 | 72.45 | 74.99 | 81.83 | |
| | | Word2Vec | - | 45.83 | 46.93 | 48.49 | 54.86 | |
| | | | | | | | | |
| | | E5-small | Small (0.03B) | 81.59 | 82.6 | 84.97 | 90.59 | |
| | | GTE-small | Small (0.03B) | 82.35 | 83.28 | 85.39 | 90.84 | |
| | | CmdCaliper-small | Small (0.03B) | **86.81** | **87.78** | **89.21** | **94.76** | |
| | | | | | | | | |
| | | BGE-en-base | Base (0.11B) | 79.49 | 80.41 | 82.33 | 87.39 | |
| | | E5-base | Base (0.11B) | 83.16 | 84.07 | 86.14 | 91.56 | |
| | | GTR-base | Base (0.11B) | 81.55 | 82.51 | 84.54 | 90.1 | |
| | | GTE-base | Base (0.11B) | 78.2 | 79.07 | 81.22 | 86.14 | |
| | | CmdCaliper-base | Base (0.11B) | **87.56** | **88.47** | **90.27** | **95.26** | |
| | | | | | | | | |
| | | BGE-en-large | Large (0.34B) | 84.11 | 84.92 | 86.64 | 91.09 | |
| | | E5-large | Large (0.34B) | 84.12 | 85.04 | 87.32 | 92.59 | |
| | | GTR-large | Large (0.34B) | 88.09 | 88.68 | 91.27 | 94.58 | |
| | | GTE-large | Large (0.34B) | 84.26 | 85.03 | 87.14 | 91.41 | |
| | | CmdCaliper-large | Large (0.34B) | **89.12** | **89.91** | **91.45** | **95.65** | |
| |
|
| | ## Usage |
| | ### HuggingFace Transformers |
| | ```python |
| | import torch.nn.functional as F |
| | from torch import Tensor |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | def average_pool(last_hidden_states: Tensor, |
| | attention_mask: Tensor) -> Tensor: |
| | last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) |
| | return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] |
| | |
| | input_texts = [ |
| | 'cronjob schedule daily 00:00 ./program.exe', |
| | 'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00', |
| | 'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X' |
| | ] |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base") |
| | model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base") |
| | |
| | # Tokenize the input texts |
| | batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') |
| | |
| | outputs = model(**batch_dict) |
| | embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
| | |
| | # (Optionally) normalize embeddings |
| | embeddings = F.normalize(embeddings, p=2, dim=1) |
| | scores = (embeddings[:1] @ embeddings[1:].T) * 100 |
| | print(scores.tolist()) |
| | ``` |
| |
|
| | ### Sentence Transformers |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | # Download from the 🤗 Hub |
| | model = SentenceTransformer("CyCraftAI/CmdCaliper-base") |
| | # Run inference |
| | sentences = [ |
| | 'cronjob schedule daily 00:00 ./program.exe', |
| | 'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00', |
| | 'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X' |
| | ] |
| | embeddings = model.encode(sentences) |
| | print(embeddings.shape) |
| | # [3, 768] |
| | |
| | # Get the similarity scores for the embeddings |
| | similarities = model.similarity(embeddings, embeddings) |
| | print(similarities.shape) |
| | # [3, 3] |
| | ``` |
| |
|
| | ## Limitation |
| | This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens. |
| |
|
| | ## Citation |
| | ``` |
| | @inproceedings{huang2024cmdcaliper, |
| | title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research}, |
| | author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang}, |
| | booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, |
| | year={2024} |
| | } |
| | ``` |