language:
- az
- en
license: apache-2.0
library_name: pytorch
tags:
- colbert
- late-interaction
- retrieval
- information-retrieval
- azerbaijani
- multilingual
base_model: LocalDoc/mmBERT-base-en-az
pipeline_tag: feature-extraction
ColBERT-AZ
A late-interaction retrieval model for Azerbaijani built on top of mmBERT-base-en-az. Trained via cross-encoder distillation from bge-reranker-v2-m3 on a mix of native Azerbaijani and translated retrieval data.
ColBERT-AZ uses late interaction (token-level MaxSim scoring) rather than dense single-vector retrieval, providing higher precision in retrieval compared to bi-encoder models of similar or larger size.
Model Details
| Property | Value |
|---|---|
| Parameters | 165M |
| Embedding dim | 128 (per token) |
| Backbone | mmBERT-base-en-az (ModernBERT) |
| Architecture | Late interaction (ColBERT) |
| Query max length | 32 tokens |
| Document max length | 256 tokens |
| Languages | Azerbaijani, English |
| Training epochs | 1 |
Training
Data
ColBERT-AZ was trained on 3 million triplets sampled from a weighted mix of four reranked datasets:
| Source | Weight | Type |
|---|---|---|
| LocalDoc/msmarco-az-reranked | 50% | Translated web search |
| LocalDoc/azerbaijani_books_retriever_corpus-reranked | 25% | Native literature |
| LocalDoc/azerbaijan_legislation_queries_passages | 15% | Native legal |
| LocalDoc/azerbaijani_retriever_corpus-reranked | 10% | Native general |
All datasets include reranker scores from bge-reranker-v2-m3, used as teacher signal for knowledge distillation.
Recipe
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1e-6 |
| Weight decay | 0.01 |
| Warmup ratio | 0.10 |
| Schedule | Cosine |
| Batch size | 16 (effective 32 via gradient accumulation) |
| Negatives per query (K) | 8 |
| False negative filter threshold | 0.9 × pos_score |
| Distillation alpha (KL weight) | 0.7 |
| Contrastive temperature | 0.05 |
| Teacher temperature | 1.0 |
| Mixed precision | BF16 |
| Epochs | 1 |
| Hardware | NVIDIA RTX 5090 (32GB) |
Loss
Combined KL distillation + InfoNCE:
L = α × KL(softmax(student_scores) || softmax(teacher_scores)) + (1 − α) × InfoNCE
where α = 0.7 and student scores are computed via MaxSim over [pos, neg_1, ..., neg_K].
Evaluation
Held-out validation
Evaluated on 4,500 held-out triplets (1,500 per native source). Each query is ranked among 1 positive and 8 hard negatives.
| Source | R@1 | R@3 | MRR | NDCG@10 |
|---|---|---|---|---|
| Books | 0.5387 | 0.7693 | 0.6821 | 0.7584 |
| Legislation | 0.6633 | 0.8433 | 0.7679 | 0.8234 |
| Retriever (general) | 0.8340 | 0.9327 | 0.8901 | 0.9167 |
| Macro average | 0.6787 | 0.8484 | 0.7800 | 0.8328 |
AZ-MIRAGE benchmark
Evaluated on the AZ-MIRAGE retrieval benchmark (7,373 queries, 40,448 document pool):
| Metric | Score |
|---|---|
| P@1 | 0.3058 |
| R@5 | 0.7518 |
| R@10 | 0.8054 |
| NDCG@5 | 0.5528 |
| NDCG@10 | 0.5704 |
| MRR@10 | 0.4930 |
| F1@10 | 0.1464 |
Comparison with bi-encoder models on AZ-MIRAGE:
| Model | Params | NDCG@10 | MRR@10 | P@1 |
|---|---|---|---|---|
| ColBERT-AZ (this model) | 165M | 0.5704 | 0.4930 | 0.3058 |
| BAAI/bge-m3 | 568M | 0.5079 | 0.4204 | 0.2310 |
| google/gemini-embedding-2-preview | API | 0.5309 | 0.4372 | 0.2338 |
| perplexity/pplx-embed-v1-4b | API | 0.5225 | 0.4361 | 0.2470 |
| microsoft/harrier-oss-v1-0.6b | 600M | 0.5168 | 0.4321 | 0.2535 |
| intfloat/multilingual-e5-large | 560M | 0.4875 | 0.4043 | 0.2264 |
| intfloat/multilingual-e5-base | 278M | 0.4672 | 0.3852 | 0.2116 |
| sentence-transformers/LaBSE | 471M | 0.2472 | 0.1944 | 0.0943 |
Usage
This repository contains:
config.json,model.safetensors,tokenizer.*— encoder backbone (mmBERT-base-en-az)projection.pt— ColBERT linear projection layer (768 → 128, no bias)
ColBERT requires both the backbone and the projection layer for correct inference.
Loading the model
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
class ColBERT(nn.Module):
def __init__(self, model_name: str, embedding_dim: int = 128):
super().__init__()
self.backbone = AutoModel.from_pretrained(model_name)
self.projection = nn.Linear(self.backbone.config.hidden_size, embedding_dim, bias=False)
@torch.no_grad()
def encode(self, input_ids, attention_mask, keep_mask=None):
out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
emb = self.projection(out.last_hidden_state)
emb = F.normalize(emb, p=2, dim=-1)
eff_mask = attention_mask if keep_mask is None else attention_mask * keep_mask
emb = emb * eff_mask.unsqueeze(-1).float()
return emb, eff_mask
# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/colbert-az")
# Add ColBERT special tokens
tokenizer.add_special_tokens({"additional_special_tokens": ["[Q]", "[D]"]})
model = ColBERT("LocalDoc/colbert-az")
model.backbone.resize_token_embeddings(len(tokenizer))
# Load projection layer
from huggingface_hub import hf_hub_download
proj_path = hf_hub_download(repo_id="LocalDoc/colbert-az", filename="projection.pt")
model.projection.load_state_dict(torch.load(proj_path, map_location="cpu"))
model = model.to(device).eval()
Encoding queries and documents
# Tokenization helpers
def tokenize_query(text: str, max_len: int = 32):
text = f"[Q] {text}"
enc = tokenizer(text, padding="max_length", truncation=True,
max_length=max_len, return_tensors="pt")
# ColBERT trick: replace pad with mask for query expansion
pad_mask = enc["input_ids"] == tokenizer.pad_token_id
enc["input_ids"][pad_mask] = tokenizer.mask_token_id
enc["attention_mask"] = torch.ones_like(enc["input_ids"])
return enc
def tokenize_doc(text: str, max_len: int = 256):
text = f"[D] {text}"
return tokenizer(text, padding=True, truncation=True,
max_length=max_len, return_tensors="pt")
# Compute MaxSim score between query and a single document
def maxsim_score(query: str, document: str) -> float:
q_enc = {k: v.to(device) for k, v in tokenize_query(query).items()}
d_enc = {k: v.to(device) for k, v in tokenize_doc(document).items()}
q_emb, _ = model.encode(q_enc["input_ids"], q_enc["attention_mask"])
d_emb, d_mask = model.encode(d_enc["input_ids"], d_enc["attention_mask"])
# MaxSim: for each query token, take max similarity over doc tokens, then sum
sim = torch.einsum("qld,bnd->qlbn", q_emb, d_emb)
sim = sim.masked_fill(~d_mask.unsqueeze(0).unsqueeze(0).bool(), float("-inf"))
max_per_token, _ = sim.max(dim=-1)
score = max_per_token.sum(dim=1).item()
return score
# Example
query = "Azərbaycan mədəniyyətinin tarixi"
doc = "Azərbaycan mədəniyyəti zəngin tarixə malikdir və qədim dövrlərdən başlayaraq inkişaf edib."
print(f"Score: {maxsim_score(query, doc):.4f}")
Recommended retrieval pipeline
For production retrieval, use ColBERT-AZ with a proper indexing library that supports late interaction:
These libraries handle efficient indexing, scalable MaxSim retrieval, and quantization for production deployment.
Citation
@misc{colbert-az-2026,
title = {ColBERT-AZ: Late-Interaction Retrieval for Azerbaijani},
author = {LocalDoc},
year = {2026},
url = {https://huggingface.co/LocalDoc/colbert-az}
}
License
Apache 2.0
Acknowledgements
- Built on top of mmBERT-base-en-az
- Distilled from bge-reranker-v2-m3
- Original ColBERT architecture: Khattab & Zaharia (2020), ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
- ColBERTv2 distillation methodology: Santhanam et al. (2022), ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction