File size: 7,015 Bytes
beaa85c a817643 beaa85c a817643 beaa85c a817643 beaa85c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 | ---
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-search
- knowledge-distillation
- modernbert
- apple-silicon
- mps
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
language:
- en
datasets:
- sentence-transformers/codesearchnet
base_model: lightonai/ColBERT-Zero
---
# ColBERT-Zero-6L-CodeSearch
A **6-layer ColBERT model** distilled from [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) (22 layers) for code search, achieving **85% of the teacher's retrieval quality at 13x faster query speed**.
## Model Details
| Parameter | Value |
|-----------|-------|
| **Architecture** | ModernBERT (6 layers, 768 hidden, 12 heads) |
| **Base Model** | [lightonai/ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) |
| **Output Dimensionality** | 128 per-token embeddings |
| **Similarity Function** | MaxSim (late interaction) |
| **Parameters** | ~38M (vs ~100M teacher) |
| **Query Length** | 32 tokens |
| **Document Length** | 180 tokens |
| **License** | Apache 2.0 |
## Benchmark Results
Evaluated on 3 code search corpora (150 questions total) via [litembeddings](https://github.com/alexandernicholson/litembeddings):
| Corpus | Teacher MRR | Student MRR | % of Teacher | Student Query Speed |
|--------|------------|-------------|--------------|---------------------|
| jq (C) | 0.539 | 0.355 | 65.9% | ~7ms |
| Rails (Ruby) | 0.679 | 0.581 | 85.6% | ~3ms |
| FastAPI (Python) | 0.782 | 0.766 | **98.0%** | ~4ms |
| **Aggregate** | **0.667** | **0.568** | **85.1%** | **~5ms** |
The student model is approximately **13x faster** at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher).
## How the Student Was Built
### Architecture: Layer Pruning from Teacher
The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a **skewed-late** strategy that preserves more upper layers (which encode retrieval-relevant semantics):
```
Teacher layers: [0, 1, 2, ..., 21] (22 total)
Student layers: [0, 8, 14, 17, 19, 21] (6 selected)
```
The student inherits:
- All embedding weights from the teacher
- The 768-to-128 ColBERT projection layer
- Selected transformer layers with full weight copying
### Training: Knowledge Distillation
- **Dataset**: [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) (10,000 comment-code pairs)
- **Teacher scoring**: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents
- **Loss**: PyLate Distillation loss (KL divergence between teacher and student score distributions)
- **Optimizer**: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1
- **Training**: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32)
- **Hardware**: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total
### Hyperparameter Search
The optimal configuration was found through **30 autonomous experiments** sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings:
- **Teacher initialization is critical**: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement
- **Skewed-late layer selection** outperforms evenly-spaced, last-6, and other strategies
- **Effective batch size 32** (bs=8, grad_accum=4) is optimal
- **Weight decay 0.01** provides regularization benefit
## Usage
### Installation
```bash
pip install pylate
```
### Encoding & Retrieval
```python
from pylate import indexes, models, retrieve
# Load model
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")
# Encode documents
doc_embeddings = model.encode(
["def hello():\n print('Hello, World!')", "class UserAuth:\n ..."],
batch_size=32,
is_query=False,
show_progress_bar=True,
)
# Encode queries
query_embeddings = model.encode(
["function that prints a greeting"],
batch_size=32,
is_query=True,
show_progress_bar=True,
)
# Score with MaxSim
from pylate.scores import colbert_scores
scores = colbert_scores(query_embeddings, doc_embeddings)
print(scores) # Higher = more relevant
```
### Reranking
```python
from pylate import rank, models
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")
queries = ["how to authenticate users"]
documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]]
documents_ids = [["doc1", "doc2", "doc3"]]
queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = model.encode(documents, is_query=False)
reranked = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=queries_embeddings,
documents_embeddings=documents_embeddings,
)
```
## GGUF / litembeddings
This model can be converted to GGUF format for use with [litembeddings](https://github.com/alexandernicholson/litembeddings) (SQLite-based embedding engine with SIMD-accelerated MaxSim):
```bash
# Convert to GGUF
python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16
# Extract projection
python -c "
from safetensors import safe_open
import numpy as np
f = safe_open('1_Dense/model.safetensors', framework='numpy')
f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection')
"
```
Then in SQL:
```sql
SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}');
SELECT lembed_maxsim(
lembed_tokens('search_query: how to sort a list'),
lembed_tokens('search_document: def quicksort(arr): ...')
);
```
## Limitations
- **Weakest on C code search** (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy
- **Trained on 10k pairs only** — larger training sets or hard negative mining could improve quality further
- **English only** — inherits ColBERT-Zero's language capabilities
- **No asymmetric prompts** — unlike the teacher, this model does not use `search_query:`/`search_document:` prompts (uses `[Q]`/`[D]` prefixes instead)
## Citation
```bibtex
@misc{colbert-zero-6l-codesearch,
title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search},
author={Alexander Nicholson},
year={2026},
note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon}
}
```
## Acknowledgments
- [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) by LightOn AI — the teacher model
- [PyLate](https://github.com/lightonai/pylate) — ColBERT training framework
- [litembeddings](https://github.com/alexandernicholson/litembeddings) — SQLite embedding engine used for benchmarking
- Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend
|