File size: 5,378 Bytes
643eff7 df8f870 5cf2056 df8f870 643eff7 df8f870 643eff7 df8f870 643eff7 75dfec7 643eff7 df8f870 643eff7 8684819 643eff7 df8f870 643eff7 df8f870 643eff7 49f83e2 eb05299 df8f870 49f83e2 df8f870 5cf2056 49f83e2 df8f870 49f83e2 df8f870 49f83e2 df8f870 49f83e2 df8f870 49f83e2 8684819 df8f870 3ad4353 40e569e 3ad4353 df8f870 78353f3 1fa0382 df8f870 8684819 df8f870 49f83e2 df8f870 49f83e2 e806337 49f83e2 e806337 5cf2056 df8f870 643eff7 df8f870 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 | ---
language:
- code
license: mit
library_name: model2vec
tags:
- model2vec
- embeddings
- code
- retrieval
- static-embeddings
datasets:
- minishlab/tokenlearn-cornstack-queries-coderankembed
- minishlab/tokenlearn-cornstack-docs-coderankembed
- nomic-ai/cornstack-python-v1
- nomic-ai/cornstack-java-v1
- nomic-ai/cornstack-php-v1
- nomic-ai/cornstack-go-v1
- nomic-ai/cornstack-javascript-v1
- nomic-ai/cornstack-ruby-v1
---
# potion-code-16M Model Card
## Overview
**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It powers [Semble](https://github.com/MinishLab/semble), a code search library for agents. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
## Installation
```bash
pip install model2vec
```
## Usage
```python
from model2vec import StaticModel
model = StaticModel.from_pretrained("minishlab/potion-code-16M")
# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])
# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n with open(path) as f:\n return f.read()"])
```
## How it works
potion-code-16M is created using the following pipeline:
1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage
## Results
Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) on [MTEB](https://github.com/embeddings-benchmark/mteb) (NDCG@10, `mteb>=2.10`):
| Model | Params | AVG | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CodeRankEmbed | 137M | 59.14 | 23.46 | 94.70 | 42.61 | 78.11 | 76.39 | 66.43 | 34.84 | 35.92 | 80.53 | 58.37 |
| **potion-code-16M + Hybrid** | **16M** | **40.41** | **5.23** | **34.03** | **51.23** | **64.26** | **33.22** | **52.67** | **31.14** | **21.63** | **69.65** | **41.03** |
| BM25 | — | 39.11 | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 |
| **potion-code-16M** | **16M** | **37.05** | **3.97** | **42.99** | **36.26** | **50.27** | **43.40** | **39.76** | **31.72** | **21.37** | **57.47** | **43.34** |
| potion-retrieval-32M | 32M | 32.10 | 4.22 | 31.80 | 36.71 | 45.11 | 38.64 | 29.97 | 32.62 | 8.70 | 56.26 | 36.93 |
| potion-base-32M | 32M | 31.42 | 3.37 | 29.58 | 34.77 | 42.69 | 37.88 | 28.51 | 30.55 | 14.61 | 53.36 | 38.88 |
CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, **CosQA** and **CodeFeedback (ST/MT)** are the most relevant tasks. Others are less so: **COIRCodeSearchNetRetrieval** retrieves text given a code query (the reverse direction), and the **CodeTransOcean** tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).
## Model Details
| Property | Value |
|---|---|
| Parameters | ~16M |
| Embedding dimensions | 256 |
| Vocabulary size | ~62,500 |
| Teacher model | nomic-ai/CodeRankEmbed |
| Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
| Max sequence length | 1,000,000 tokens (static, no limit in practice) |
## Additional Resources
- [Semble repository](https://github.com/MinishLab/semble)
- [Model2Vec repository](https://github.com/MinishLab/model2vec)
- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
- [CoIR benchmark](https://github.com/CoIR-team/coir)
## Reproducibility
The full training pipeline (distill → tokenlearn → contrastive) is in [`train.py`](./train.py). It requires `minishlab/tokenlearn-cornstack-docs-coderankembed` and `minishlab/tokenlearn-cornstack-queries-coderankembed` (20k samples per language used).
```
pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
python train.py
```
## Citation
```bibtex
@software{minishlab2024model2vec,
author = {Stephan Tulkens and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17270888},
url = {https://github.com/MinishLab/model2vec},
license = {MIT}
}
```
|