Pringled commited on
Commit
643eff7
·
verified ·
1 Parent(s): 6901f13

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ license: mit
5
+ library_name: model2vec
6
+ tags:
7
+ - model2vec
8
+ - embeddings
9
+ - code
10
+ - retrieval
11
+ - static-embeddings
12
+ ---
13
+
14
+ # potion-code-16M Model Card
15
+
16
+ ## Overview
17
+
18
+ **potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
19
+
20
+ It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
21
+
22
+ ## Installation
23
+
24
+ ```bash
25
+ pip install model2vec
26
+ ```
27
+
28
+ ## Usage
29
+
30
+ ```python
31
+ from model2vec import StaticModel
32
+
33
+ model = StaticModel.from_pretrained("Pringled/potion-code-16M")
34
+
35
+ # Embed natural language queries
36
+ query_embeddings = model.encode(["How to read a file in Python?"])
37
+
38
+ # Embed code documents
39
+ code_embeddings = model.encode(["def read_file(path):\n with open(path) as f:\n return f.read()"])
40
+ ```
41
+
42
+ ## How it works
43
+
44
+ potion-code-16M is created using the following pipeline:
45
+
46
+ 1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
47
+ 2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
48
+ 3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
49
+ 4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
50
+ 5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage
51
+
52
+ ## Results
53
+
54
+ Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) (NDCG@10, `mteb>=2.10`):
55
+
56
+ | Model | Params | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL | **AVG** |
57
+ |---|---|---|---|---|---|---|---|---|---|---|---|---|
58
+ | CodeRankEmbed | 137M | - | - | - | - | - | - | - | - | - | - | - |
59
+ | BM25 | — | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 | 39.11 |
60
+ | **potion-code-16M** | **16M** | - | - | - | - | - | - | - | - | - | - | - |
61
+
62
+ *Results for CodeRankEmbed and potion-code-16M coming soon.*
63
+
64
+ ## Model Details
65
+
66
+ | Property | Value |
67
+ |---|---|
68
+ | Parameters | ~16M |
69
+ | Embedding dimensions | 256 |
70
+ | Vocabulary size | ~62,500 |
71
+ | Teacher model | nomic-ai/CodeRankEmbed |
72
+ | Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
73
+ | Max sequence length | 1,000,000 tokens (static, no limit in practice) |
74
+
75
+ ## Additional Resources
76
+
77
+ - [Model2Vec repository](https://github.com/MinishLab/model2vec)
78
+ - [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
79
+ - [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
80
+ - [CoIR benchmark](https://github.com/CoIR-team/coir)
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @software{minishlab2024model2vec,
86
+ author = {Stephan Tulkens and {van Dongen}, Thomas},
87
+ title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
88
+ year = {2024},
89
+ publisher = {Zenodo},
90
+ doi = {10.5281/zenodo.17270888},
91
+ url = {https://github.com/MinishLab/model2vec},
92
+ license = {MIT}
93
+ }
94
+ ```