File size: 5,378 Bytes

643eff7
df8f870
 
5cf2056
df8f870
643eff7
df8f870
643eff7
df8f870
 
643eff7
75dfec7
 
 
 
 
 
 
 
 
643eff7
 
df8f870
 
 
643eff7
8684819
643eff7
df8f870
643eff7
 
 
df8f870
643eff7
 
 
 
 
49f83e2
 
 
eb05299
df8f870
 
 
49f83e2
df8f870
 
5cf2056
49f83e2
df8f870
49f83e2
df8f870
49f83e2
df8f870
 
 
 
 
49f83e2
df8f870
49f83e2
8684819
df8f870
3ad4353
 
 
40e569e
3ad4353
 
 
 
df8f870
78353f3
1fa0382
df8f870
 
 
 
 
 
 
 
 
 
 
 
 
8684819
df8f870
 
 
 
49f83e2
df8f870
49f83e2
e806337
49f83e2
e806337
 
 
5cf2056
df8f870
 
 
 
643eff7
 
 
 
 
 
 
 
 
df8f870

---
language:
- code
license: mit
library_name: model2vec
tags:
- model2vec
- embeddings
- code
- retrieval
- static-embeddings
datasets:
- minishlab/tokenlearn-cornstack-queries-coderankembed
- minishlab/tokenlearn-cornstack-docs-coderankembed
- nomic-ai/cornstack-python-v1
- nomic-ai/cornstack-java-v1
- nomic-ai/cornstack-php-v1
- nomic-ai/cornstack-go-v1
- nomic-ai/cornstack-javascript-v1
- nomic-ai/cornstack-ruby-v1
---

# potion-code-16M Model Card

## Overview

**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It powers [Semble](https://github.com/MinishLab/semble), a code search library for agents. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.

It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.

## Installation

```bash
pip install model2vec
```

## Usage

```python
from model2vec import StaticModel

model = StaticModel.from_pretrained("minishlab/potion-code-16M")

# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])

# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n    with open(path) as f:\n        return f.read()"])
```

## How it works

potion-code-16M is created using the following pipeline:

1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage

## Results

Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) on [MTEB](https://github.com/embeddings-benchmark/mteb) (NDCG@10, `mteb>=2.10`):

| Model | Params | AVG | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CodeRankEmbed | 137M | 59.14 | 23.46 | 94.70 | 42.61 | 78.11 | 76.39 | 66.43 | 34.84 | 35.92 | 80.53 | 58.37 |
| **potion-code-16M + Hybrid** | **16M** | **40.41** | **5.23** | **34.03** | **51.23** | **64.26** | **33.22** | **52.67** | **31.14** | **21.63** | **69.65** | **41.03** |
| BM25 | — | 39.11 | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 |
| **potion-code-16M** | **16M** | **37.05** | **3.97** | **42.99** | **36.26** | **50.27** | **43.40** | **39.76** | **31.72** | **21.37** | **57.47** | **43.34** |
| potion-retrieval-32M | 32M | 32.10 | 4.22 | 31.80 | 36.71 | 45.11 | 38.64 | 29.97 | 32.62 | 8.70 | 56.26 | 36.93 |
| potion-base-32M | 32M | 31.42 | 3.37 | 29.58 | 34.77 | 42.69 | 37.88 | 28.51 | 30.55 | 14.61 | 53.36 | 38.88 |

CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, **CosQA** and **CodeFeedback (ST/MT)** are the most relevant tasks. Others are less so: **COIRCodeSearchNetRetrieval** retrieves text given a code query (the reverse direction), and the **CodeTransOcean** tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).

## Model Details

| Property | Value |
|---|---|
| Parameters | ~16M |
| Embedding dimensions | 256 |
| Vocabulary size | ~62,500 |
| Teacher model | nomic-ai/CodeRankEmbed |
| Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
| Max sequence length | 1,000,000 tokens (static, no limit in practice) |

## Additional Resources

- [Semble repository](https://github.com/MinishLab/semble)
- [Model2Vec repository](https://github.com/MinishLab/model2vec)
- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
- [CoIR benchmark](https://github.com/CoIR-team/coir)

## Reproducibility

The full training pipeline (distill → tokenlearn → contrastive) is in [`train.py`](./train.py). It requires `minishlab/tokenlearn-cornstack-docs-coderankembed` and `minishlab/tokenlearn-cornstack-queries-coderankembed` (20k samples per language used).

```
pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
python train.py
```

## Citation

```bibtex
@software{minishlab2024model2vec,
  author       = {Stephan Tulkens and {van Dongen}, Thomas},
  title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17270888},
  url          = {https://github.com/MinishLab/model2vec},
  license      = {MIT}
}
```