Update README.md

ea7539a verified over 1 year ago

5.43 kB

	---
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	datasets:
	- CyCraftAI/CyPHER
	extra_gated_fields:
	First Name: text
	Last Name: text
	Date of birth: date_picker
	Country: country
	Affiliation: text
	Job title:
	type: select
	options:
	- Student
	- Research Graduate
	- AI researcher
	- AI developer/engineer
	- Reporter
	- Other
	geo: ip_location
	---

	# CmdCaliper-base
	## [[Dataset](https://huggingface.co/datasets/CyCraftAI/CyPHER)] [[Code](https://github.com/cycraft-corp/CmdCaliper)] [[Paper](https://arxiv.org/abs/2411.01176)]

	The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks.

	CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints.

	CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research".

	## Metric
	\| Methods \| Model Parameters \| MRR @3 \| MRR @10 \| Top @3 \| Top @10 \|
	\|---------------------\|--------------------\|--------\|---------\|--------\|---------\|
	\| Levenshtein distance \| - \| 71.23 \| 72.45 \| 74.99 \| 81.83 \|
	\| Word2Vec \| - \| 45.83 \| 46.93 \| 48.49 \| 54.86 \|
	\| \| \| \| \| \| \|
	\| E5-small \| Small (0.03B) \| 81.59 \| 82.6 \| 84.97 \| 90.59 \|
	\| GTE-small \| Small (0.03B) \| 82.35 \| 83.28 \| 85.39 \| 90.84 \|
	\| CmdCaliper-small \| Small (0.03B) \| 86.81 \| 87.78 \| 89.21 \| 94.76 \|
	\| \| \| \| \| \| \|
	\| BGE-en-base \| Base (0.11B) \| 79.49 \| 80.41 \| 82.33 \| 87.39 \|
	\| E5-base \| Base (0.11B) \| 83.16 \| 84.07 \| 86.14 \| 91.56 \|
	\| GTR-base \| Base (0.11B) \| 81.55 \| 82.51 \| 84.54 \| 90.1 \|
	\| GTE-base \| Base (0.11B) \| 78.2 \| 79.07 \| 81.22 \| 86.14 \|
	\| CmdCaliper-base \| Base (0.11B) \| 87.56 \| 88.47 \| 90.27 \| 95.26 \|
	\| \| \| \| \| \| \|
	\| BGE-en-large \| Large (0.34B) \| 84.11 \| 84.92 \| 86.64 \| 91.09 \|
	\| E5-large \| Large (0.34B) \| 84.12 \| 85.04 \| 87.32 \| 92.59 \|
	\| GTR-large \| Large (0.34B) \| 88.09 \| 88.68 \| 91.27 \| 94.58 \|
	\| GTE-large \| Large (0.34B) \| 84.26 \| 85.03 \| 87.14 \| 91.41 \|
	\| CmdCaliper-large \| Large (0.34B) \| 89.12 \| 89.91 \| 91.45 \| 95.65 \|

	## Usage
	### HuggingFace Transformers
	```python
	import torch.nn.functional as F
	from torch import Tensor
	from transformers import AutoTokenizer, AutoModel

	def average_pool(last_hidden_states: Tensor,
	attention_mask: Tensor) -> Tensor:
	last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
	return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

	input_texts = [
	'cronjob schedule daily 00:00 ./program.exe',
	'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
	'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
	]

	tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
	model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")

	# Tokenize the input texts
	batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

	outputs = model(**batch_dict)
	embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

	# (Optionally) normalize embeddings
	embeddings = F.normalize(embeddings, p=2, dim=1)
	scores = (embeddings[:1] @ embeddings[1:].T) * 100
	print(scores.tolist())
	```

	### Sentence Transformers
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
	# Run inference
	sentences = [
	'cronjob schedule daily 00:00 ./program.exe',
	'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
	'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	## Limitation
	This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.

	## Citation
	```
	@inproceedings{huang2024cmdcaliper,
	title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
	author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
	booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
	year={2024}
	}
	```