Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- code
|
| 4 |
+
license: mit
|
| 5 |
+
library_name: model2vec
|
| 6 |
+
tags:
|
| 7 |
+
- model2vec
|
| 8 |
+
- embeddings
|
| 9 |
+
- code
|
| 10 |
+
- retrieval
|
| 11 |
+
- static-embeddings
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# potion-code-16M Model Card
|
| 15 |
+
|
| 16 |
+
## Overview
|
| 17 |
+
|
| 18 |
+
**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
|
| 19 |
+
|
| 20 |
+
It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
|
| 21 |
+
|
| 22 |
+
## Installation
|
| 23 |
+
|
| 24 |
+
```bash
|
| 25 |
+
pip install model2vec
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
```python
|
| 31 |
+
from model2vec import StaticModel
|
| 32 |
+
|
| 33 |
+
model = StaticModel.from_pretrained("Pringled/potion-code-16M")
|
| 34 |
+
|
| 35 |
+
# Embed natural language queries
|
| 36 |
+
query_embeddings = model.encode(["How to read a file in Python?"])
|
| 37 |
+
|
| 38 |
+
# Embed code documents
|
| 39 |
+
code_embeddings = model.encode(["def read_file(path):\n with open(path) as f:\n return f.read()"])
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
## How it works
|
| 43 |
+
|
| 44 |
+
potion-code-16M is created using the following pipeline:
|
| 45 |
+
|
| 46 |
+
1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
|
| 47 |
+
2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
|
| 48 |
+
3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
|
| 49 |
+
4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
|
| 50 |
+
5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage
|
| 51 |
+
|
| 52 |
+
## Results
|
| 53 |
+
|
| 54 |
+
Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) (NDCG@10, `mteb>=2.10`):
|
| 55 |
+
|
| 56 |
+
| Model | Params | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL | **AVG** |
|
| 57 |
+
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 58 |
+
| CodeRankEmbed | 137M | - | - | - | - | - | - | - | - | - | - | - |
|
| 59 |
+
| BM25 | — | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 | 39.11 |
|
| 60 |
+
| **potion-code-16M** | **16M** | - | - | - | - | - | - | - | - | - | - | - |
|
| 61 |
+
|
| 62 |
+
*Results for CodeRankEmbed and potion-code-16M coming soon.*
|
| 63 |
+
|
| 64 |
+
## Model Details
|
| 65 |
+
|
| 66 |
+
| Property | Value |
|
| 67 |
+
|---|---|
|
| 68 |
+
| Parameters | ~16M |
|
| 69 |
+
| Embedding dimensions | 256 |
|
| 70 |
+
| Vocabulary size | ~62,500 |
|
| 71 |
+
| Teacher model | nomic-ai/CodeRankEmbed |
|
| 72 |
+
| Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
|
| 73 |
+
| Max sequence length | 1,000,000 tokens (static, no limit in practice) |
|
| 74 |
+
|
| 75 |
+
## Additional Resources
|
| 76 |
+
|
| 77 |
+
- [Model2Vec repository](https://github.com/MinishLab/model2vec)
|
| 78 |
+
- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
|
| 79 |
+
- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
|
| 80 |
+
- [CoIR benchmark](https://github.com/CoIR-team/coir)
|
| 81 |
+
|
| 82 |
+
## Citation
|
| 83 |
+
|
| 84 |
+
```bibtex
|
| 85 |
+
@software{minishlab2024model2vec,
|
| 86 |
+
author = {Stephan Tulkens and {van Dongen}, Thomas},
|
| 87 |
+
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
|
| 88 |
+
year = {2024},
|
| 89 |
+
publisher = {Zenodo},
|
| 90 |
+
doi = {10.5281/zenodo.17270888},
|
| 91 |
+
url = {https://github.com/MinishLab/model2vec},
|
| 92 |
+
license = {MIT}
|
| 93 |
+
}
|
| 94 |
+
```
|