potion-code-16M / README.md
Pringled's picture
Add Semble reference, MTEB link, and Semble in additional resources
8684819 unverified
---
language:
- code
license: mit
library_name: model2vec
tags:
- model2vec
- embeddings
- code
- retrieval
- static-embeddings
datasets:
- minishlab/tokenlearn-cornstack-queries-coderankembed
- minishlab/tokenlearn-cornstack-docs-coderankembed
- nomic-ai/cornstack-python-v1
- nomic-ai/cornstack-java-v1
- nomic-ai/cornstack-php-v1
- nomic-ai/cornstack-go-v1
- nomic-ai/cornstack-javascript-v1
- nomic-ai/cornstack-ruby-v1
---
# potion-code-16M Model Card
## Overview
**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It powers [Semble](https://github.com/MinishLab/semble), a code search library for agents. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
## Installation
```bash
pip install model2vec
```
## Usage
```python
from model2vec import StaticModel
model = StaticModel.from_pretrained("minishlab/potion-code-16M")
# Embed natural language queries
query_embeddings = model.encode(["How to read a file in Python?"])
# Embed code documents
code_embeddings = model.encode(["def read_file(path):\n with open(path) as f:\n return f.read()"])
```
## How it works
potion-code-16M is created using the following pipeline:
1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage
## Results
Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) on [MTEB](https://github.com/embeddings-benchmark/mteb) (NDCG@10, `mteb>=2.10`):
| Model | Params | AVG | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CodeRankEmbed | 137M | 59.14 | 23.46 | 94.70 | 42.61 | 78.11 | 76.39 | 66.43 | 34.84 | 35.92 | 80.53 | 58.37 |
| **potion-code-16M + Hybrid** | **16M** | **40.41** | **5.23** | **34.03** | **51.23** | **64.26** | **33.22** | **52.67** | **31.14** | **21.63** | **69.65** | **41.03** |
| BM25 | — | 39.11 | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 |
| **potion-code-16M** | **16M** | **37.05** | **3.97** | **42.99** | **36.26** | **50.27** | **43.40** | **39.76** | **31.72** | **21.37** | **57.47** | **43.34** |
| potion-retrieval-32M | 32M | 32.10 | 4.22 | 31.80 | 36.71 | 45.11 | 38.64 | 29.97 | 32.62 | 8.70 | 56.26 | 36.93 |
| potion-base-32M | 32M | 31.42 | 3.37 | 29.58 | 34.77 | 42.69 | 37.88 | 28.51 | 30.55 | 14.61 | 53.36 | 38.88 |
CoIR covers a broad range of code retrieval scenarios. For the use case of finding code given a natural language query, **CosQA** and **CodeFeedback (ST/MT)** are the most relevant tasks. Others are less so: **COIRCodeSearchNetRetrieval** retrieves text given a code query (the reverse direction), and the **CodeTransOcean** tasks target cross-language code translation. The hybrid row combines dense retrieval with BM25 using min-max score normalization and equal weighting (alpha=0.5).
## Model Details
| Property | Value |
|---|---|
| Parameters | ~16M |
| Embedding dimensions | 256 |
| Vocabulary size | ~62,500 |
| Teacher model | nomic-ai/CodeRankEmbed |
| Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
| Max sequence length | 1,000,000 tokens (static, no limit in practice) |
## Additional Resources
- [Semble repository](https://github.com/MinishLab/semble)
- [Model2Vec repository](https://github.com/MinishLab/model2vec)
- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
- [CoIR benchmark](https://github.com/CoIR-team/coir)
## Reproducibility
The full training pipeline (distill → tokenlearn → contrastive) is in [`train.py`](./train.py). It requires `minishlab/tokenlearn-cornstack-docs-coderankembed` and `minishlab/tokenlearn-cornstack-queries-coderankembed` (20k samples per language used).
```
pip install model2vec tokenlearn sentence-transformers datasets skeletoken einops
python train.py
```
## Citation
```bibtex
@software{minishlab2024model2vec,
author = {Stephan Tulkens and {van Dongen}, Thomas},
title = {Model2Vec: Fast State-of-the-Art Static Embeddings},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.17270888},
url = {https://github.com/MinishLab/model2vec},
license = {MIT}
}
```