minishlab
/

potion-code-16M

+---
+language:
+- code
+license: mit
+library_name: model2vec
+tags:
+- model2vec
+- embeddings
+- code
+- retrieval
+- static-embeddings
+---
+# potion-code-16M Model Card
+## Overview
+**potion-code-16M** is a fast static code embedding model optimized for code retrieval tasks. It is distilled from [nomic-ai/CodeRankEmbed](https://huggingface.co/nomic-ai/CodeRankEmbed) and trained on the [CornStack](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1) code corpus using [Tokenlearn](https://github.com/MinishLab/tokenlearn) and contrastive fine-tuning.
+It uses static embeddings, allowing text and code embeddings to be computed orders of magnitude faster than transformer-based models on both GPU and CPU.
+## Installation
+```bash
+pip install model2vec
+```
+## Usage
+```python
+from model2vec import StaticModel
+model = StaticModel.from_pretrained("Pringled/potion-code-16M")
+# Embed natural language queries
+query_embeddings = model.encode(["How to read a file in Python?"])
+# Embed code documents
+code_embeddings = model.encode(["def read_file(path):\n    with open(path) as f:\n        return f.read()"])
+```
+## How it works
+potion-code-16M is created using the following pipeline:
+1. **Vocabulary mining**: code-specific tokens are mined from CornStack and added to the base CodeRankEmbed tokenizer (42k extra tokens → ~62.5k total)
+2. **Distillation**: the extended vocabulary is distilled from CodeRankEmbed using Model2Vec (256-dimensional embeddings, PCA whitening)
+3. **Tokenlearn**: the distilled model is fine-tuned on 240k (query, document) pairs from CornStack using cosine similarity loss
+4. **Contrastive fine-tuning**: the model is further fine-tuned using MultipleNegativesRankingLoss on 120k CornStack query-document pairs
+5. **Post-SIF re-regularization**: token weights are re-regularized using SIF weighting after each training stage
+## Results
+Results on the [CoIR benchmark](https://github.com/CoIR-team/coir) (NDCG@10, `mteb>=2.10`):
+| Model | Params | AppsRetrieval | COIRCodeSearchNet | CodeFeedbackMT | CodeFeedbackST | CodeSearchNetCC | CodeTransContest | CodeTransDL | CosQA | StackOverflow | Text2SQL | **AVG** |
+|---|---|---|---|---|---|---|---|---|---|---|---|---|
+| CodeRankEmbed | 137M | - | - | - | - | - | - | - | - | - | - | - |
+| BM25 | — | 4.76 | 32.45 | 59.69 | 67.85 | 33.00 | 47.29 | 32.97 | 15.53 | 69.54 | 28.07 | 39.11 |
+| **potion-code-16M** | **16M** | - | - | - | - | - | - | - | - | - | - | - |
+*Results for CodeRankEmbed and potion-code-16M coming soon.*
+## Model Details
+| Property | Value |
+|---|---|
+| Parameters | ~16M |
+| Embedding dimensions | 256 |
+| Vocabulary size | ~62,500 |
+| Teacher model | nomic-ai/CodeRankEmbed |
+| Training corpus | CornStack (6 languages: Python, Java, JavaScript, Go, PHP, Ruby) |
+| Max sequence length | 1,000,000 tokens (static, no limit in practice) |
+## Additional Resources
+- [Model2Vec repository](https://github.com/MinishLab/model2vec)
+- [Tokenlearn repository](https://github.com/MinishLab/tokenlearn)
+- [CornStack dataset](https://huggingface.co/datasets/nomic-ai/cornstack-python-v1)
+- [CoIR benchmark](https://github.com/CoIR-team/coir)
+## Citation
+```bibtex
+@software{minishlab2024model2vec,
+  author       = {Stephan Tulkens and {van Dongen}, Thomas},
+  title        = {Model2Vec: Fast State-of-the-Art Static Embeddings},
+  year         = {2024},
+  publisher    = {Zenodo},
+  doi          = {10.5281/zenodo.17270888},
+  url          = {https://github.com/MinishLab/model2vec},
+  license      = {MIT}
+}
+```