| --- |
| language: |
| - code |
| license: mit |
| library_name: model2vec |
| tags: |
| - model2vec |
| - embeddings |
| - code |
| - retrieval |
| - static-embeddings |
| datasets: |
| - minishlab/tokenlearn-cornstack-queries-coderankembed |
| - minishlab/tokenlearn-cornstack-docs-coderankembed |
| - nomic-ai/cornstack-python-v1 |
| - nomic-ai/cornstack-java-v1 |
| - nomic-ai/cornstack-php-v1 |
| - nomic-ai/cornstack-go-v1 |
| - nomic-ai/cornstack-javascript-v1 |
| - nomic-ai/cornstack-ruby-v1 |
| --- |
| |
| # potion-code-16M Model Card |
|
|
| ## Overview |
|
|
| **potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It powers [Semble](https://github.com/MinishLab/semble), a code search library for agents. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning. |
|
|
| It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU. |
|
|
| ## Installation |
|
|
| ```bash |
| pip install model2vec |
| ``` |
|
|
| ## Usage |
|
|
| ```python |
| from model2vec import StaticModel |
| |
| model = StaticModel.from_pretrained("minishlab/potion-code-16M") |
| |
| # Embed natural language queries |
| query_embeddings = model.encode(["How to read a file in Python?"]) |
| |
| # Embed code documents |
| code_embeddings = model.encode(["def read_file(path):\n with open(path) as f:\n return f.read()"]) |
| ``` |
|
|
| ## How it works |
|
|
| potion-code-16M is created using the following pipeline: |
|
|
| 1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total) |
| 2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening) |
| 3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss |
| 4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs |
| 5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage |
|
|
| ## Results |
|
|
| Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) on [MTEB](https://github.com/embeddings-benchmark/mteb) (NDCG@10, `mteb>=2.10`): |
|
|
| | Model | Params | AVG | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL | |
| |---|---|---|---|---|---|---|---|---|---|---|---|---| |
| | CodeRankEmbed | 137M | 59.14 | 23.46 | 94.70 | 42.61 | 78.11 | 76.39 | 66.43 | 34.84 | 35.92 | 80.53 | 58.37 | |
| | **potion-code-16M + Hybrid** | **16M** | **40.41** | **5.23** | **34.03** | **51.23** | **64.26** | **33.22** | **52.67** | **31.14** | **21.63** | **69.65** | **41.03** | |
| | BM25 | — | 39.11 | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 | |
| | **potion-code-16M** | **16M** | **37.05** | **3.97** | **42.99** | **36.26** | **50.27** | **43.40** | **39.76** | **31.72** | **21.37** | **57.47** | **43.34** | |
| | potion-retrieval-32M | 32M | 32.10 | 4.22 | 31.80 | 36.71 | 45.11 | 38.64 | 29.97 | 32.62 | 8.70 | 56.26 | 36.93 | |
| | potion-base-32M | 32M | 31.42 | 3.37 | 29.58 | 34.77 | 42.69 | 37.88 | 28.51 | 30.55 | 14.61 | 53.36 | 38.88 | |
|
|
| CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, **CosQA** and **CodeFeedback (ST/MT)** are the most relevant tasks. Others are less so: **COIRCodeSearchNetRetrieval** retrieves text given a code query (the reverse direction), and the **CodeTransOcean** tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5). |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Parameters | ~16M | |
| | Embedding dimensions | 256 | |
| | Vocabulary size | ~62,500 | |
| | Teacher model | nomic-ai/CodeRankEmbed | |
| | Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) | |
| | Max sequence length | 1,000,000 tokens (static, no limit in practice) | |
|
|
| ## Additional Resources |
|
|
| - [Semble repository](https://github.com/MinishLab/semble) |
| - [Model2Vec repository](https://github.com/MinishLab/model2vec) |
| - [Tokenlearn repository](https://github.com/MinishLab/tokenlearn) |
| - [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) |
| - [CoIR benchmark](https://github.com/CoIR-team/coir) |
|
|
| ## Reproducibility |
|
|
| The full training pipeline (distill → tokenlearn → contrastive) is in [`train.py`](./train.py). It requires `minishlab/tokenlearn-cornstack-docs-coderankembed` and `minishlab/tokenlearn-cornstack-queries-coderankembed` (20k samples per language used). |
|
|
| ``` |
| pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops |
| python train.py |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{minishlab2024model2vec, |
| author = {Stephan Tulkens and {van Dongen}, Thomas}, |
| title = {Model2Vec: Fast State-of-the-Art Static Embeddings}, |
| year = {2024}, |
| publisher = {Zenodo}, |
| doi = {10.5281/zenodo.17270888}, |
| url = {https://github.com/MinishLab/model2vec}, |
| license = {MIT} |
| } |
| ``` |
|
|