File size: 7,015 Bytes
beaa85c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a817643
beaa85c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a817643
beaa85c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a817643
beaa85c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
tags:
- ColBERT
- PyLate
- sentence-transformers
- sentence-similarity
- feature-extraction
- code-search
- knowledge-distillation
- modernbert
- apple-silicon
- mps
pipeline_tag: sentence-similarity
library_name: PyLate
license: apache-2.0
language:
- en
datasets:
- sentence-transformers/codesearchnet
base_model: lightonai/ColBERT-Zero
---

# ColBERT-Zero-6L-CodeSearch

A **6-layer ColBERT model** distilled from [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) (22 layers) for code search, achieving **85% of the teacher's retrieval quality at 13x faster query speed**.

## Model Details

| Parameter | Value |
|-----------|-------|
| **Architecture** | ModernBERT (6 layers, 768 hidden, 12 heads) |
| **Base Model** | [lightonai/ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) |
| **Output Dimensionality** | 128 per-token embeddings |
| **Similarity Function** | MaxSim (late interaction) |
| **Parameters** | ~38M (vs ~100M teacher) |
| **Query Length** | 32 tokens |
| **Document Length** | 180 tokens |
| **License** | Apache 2.0 |

## Benchmark Results

Evaluated on 3 code search corpora (150 questions total) via [litembeddings](https://github.com/alexandernicholson/litembeddings):

| Corpus | Teacher MRR | Student MRR | % of Teacher | Student Query Speed |
|--------|------------|-------------|--------------|---------------------|
| jq (C) | 0.539 | 0.355 | 65.9% | ~7ms |
| Rails (Ruby) | 0.679 | 0.581 | 85.6% | ~3ms |
| FastAPI (Python) | 0.782 | 0.766 | **98.0%** | ~4ms |
| **Aggregate** | **0.667** | **0.568** | **85.1%** | **~5ms** |

The student model is approximately **13x faster** at query time than the teacher while retaining 85% of retrieval quality. Performance is particularly strong on Python code search (98% of teacher).

## How the Student Was Built

### Architecture: Layer Pruning from Teacher

The student was created by selecting 6 layers from ColBERT-Zero's 22-layer ModernBERT backbone using a **skewed-late** strategy that preserves more upper layers (which encode retrieval-relevant semantics):

```
Teacher layers: [0, 1, 2, ..., 21]  (22 total)
Student layers: [0, 8, 14, 17, 19, 21]  (6 selected)
```

The student inherits:
- All embedding weights from the teacher
- The 768-to-128 ColBERT projection layer
- Selected transformer layers with full weight copying

### Training: Knowledge Distillation

- **Dataset**: [CodeSearchNet](https://huggingface.co/datasets/sentence-transformers/codesearchnet) (10,000 comment-code pairs)
- **Teacher scoring**: ColBERT-Zero generates MaxSim relevance scores for each query against 1 positive + 3 random negative documents
- **Loss**: PyLate Distillation loss (KL divergence between teacher and student score distributions)
- **Optimizer**: AdamW, lr=5e-5, weight_decay=0.01, warmup_ratio=0.1
- **Training**: 1000 steps, batch_size=8, gradient_accumulation=4 (effective batch size 32)
- **Hardware**: Apple Silicon (M4 Max) via PyTorch MPS backend, ~17 minutes total

### Hyperparameter Search

The optimal configuration was found through **30 autonomous experiments** sweeping learning rate, layer selection strategy, batch size, gradient accumulation, weight decay, warmup ratio, number of negatives, training steps, and embedding dimensions. Key findings:

- **Teacher initialization is critical**: Starting from ColBERT-Zero's weights (MRR 0.46) vs raw ModernBERT (MRR 0.08) — a 5.6x improvement
- **Skewed-late layer selection** outperforms evenly-spaced, last-6, and other strategies
- **Effective batch size 32** (bs=8, grad_accum=4) is optimal
- **Weight decay 0.01** provides regularization benefit

## Usage

### Installation

```bash
pip install pylate
```

### Encoding & Retrieval

```python
from pylate import indexes, models, retrieve

# Load model
model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

# Encode documents
doc_embeddings = model.encode(
    ["def hello():\n    print('Hello, World!')", "class UserAuth:\n    ..."],
    batch_size=32,
    is_query=False,
    show_progress_bar=True,
)

# Encode queries
query_embeddings = model.encode(
    ["function that prints a greeting"],
    batch_size=32,
    is_query=True,
    show_progress_bar=True,
)

# Score with MaxSim
from pylate.scores import colbert_scores
scores = colbert_scores(query_embeddings, doc_embeddings)
print(scores)  # Higher = more relevant
```

### Reranking

```python
from pylate import rank, models

model = models.ColBERT(model_name_or_path="ctrltokyo/ColBERT-Zero-6L-CodeSearch")

queries = ["how to authenticate users"]
documents = [["def login(user, pwd): ...", "def sort_list(arr): ...", "class AuthMiddleware: ..."]]
documents_ids = [["doc1", "doc2", "doc3"]]

queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = model.encode(documents, is_query=False)

reranked = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)
```

## GGUF / litembeddings

This model can be converted to GGUF format for use with [litembeddings](https://github.com/alexandernicholson/litembeddings) (SQLite-based embedding engine with SIMD-accelerated MaxSim):

```bash
# Convert to GGUF
python convert_hf_to_gguf.py ctrltokyo/ColBERT-Zero-6L-CodeSearch --outfile model-f16.gguf --outtype f16

# Extract projection
python -c "
from safetensors import safe_open
import numpy as np
f = safe_open('1_Dense/model.safetensors', framework='numpy')
f.get_tensor('linear.weight').astype(np.float32).tofile('model.projection')
"
```

Then in SQL:
```sql
SELECT lembed_model('codesearch', 'model-f16.gguf', '{"colbert_projection": "model.projection"}');
SELECT lembed_maxsim(
    lembed_tokens('search_query: how to sort a list'),
    lembed_tokens('search_document: def quicksort(arr): ...')
);
```

## Limitations

- **Weakest on C code search** (65.9% of teacher on jq corpus) — likely because CodeSearchNet training data is Python-heavy
- **Trained on 10k pairs only** — larger training sets or hard negative mining could improve quality further
- **English only** — inherits ColBERT-Zero's language capabilities
- **No asymmetric prompts** — unlike the teacher, this model does not use `search_query:`/`search_document:` prompts (uses `[Q]`/`[D]` prefixes instead)

## Citation

```bibtex
@misc{colbert-zero-6l-codesearch,
  title={ColBERT-Zero-6L-CodeSearch: A Distilled ColBERT Model for Code Search},
  author={Alexander Nicholson},
  year={2026},
  note={Distilled from ColBERT-Zero (Chaffin et al., 2026) using PyLate on Apple Silicon}
}
```

## Acknowledgments

- [ColBERT-Zero](https://huggingface.co/lightonai/ColBERT-Zero) by LightOn AI — the teacher model
- [PyLate](https://github.com/lightonai/pylate) — ColBERT training framework
- [litembeddings](https://github.com/alexandernicholson/litembeddings) — SQLite embedding engine used for benchmarking
- Training and experimentation performed entirely on Apple Silicon (M4 Max) using PyTorch MPS backend