| --- |
| language: |
| - az |
| - en |
| license: apache-2.0 |
| library_name: pytorch |
| tags: |
| - colbert |
| - late-interaction |
| - retrieval |
| - information-retrieval |
| - azerbaijani |
| - multilingual |
| base_model: LocalDoc/mmBERT-base-en-az |
| pipeline_tag: feature-extraction |
| --- |
| |
| # ColBERT-AZ |
|
|
| A late-interaction retrieval model for Azerbaijani built on top of [mmBERT-base-en-az](https://huggingface.co/LocalDoc/mmBERT-base-en-az). Trained via cross-encoder distillation from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) on a mix of native Azerbaijani and translated retrieval data. |
|
|
| ColBERT-AZ uses **late interaction** (token-level MaxSim scoring) rather than dense single-vector retrieval, providing higher precision in retrieval compared to bi-encoder models of similar or larger size. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Parameters** | 165M | |
| | **Embedding dim** | 128 (per token) | |
| | **Backbone** | mmBERT-base-en-az (ModernBERT) | |
| | **Architecture** | Late interaction (ColBERT) | |
| | **Query max length** | 32 tokens | |
| | **Document max length** | 256 tokens | |
| | **Languages** | Azerbaijani, English | |
| | **Training epochs** | 1 | |
|
|
| ## Training |
|
|
| ### Data |
|
|
| ColBERT-AZ was trained on **3 million triplets** sampled from a weighted mix of four reranked datasets: |
|
|
| | Source | Weight | Type | |
| |--------|--------|------| |
| | [LocalDoc/msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) | 50% | Translated web search | |
| | [LocalDoc/azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) | 25% | Native literature | |
| | [LocalDoc/azerbaijan_legislation_queries_passages](https://huggingface.co/datasets/LocalDoc/azerbaijan_legislation_queries_passages) | 15% | Native legal | |
| | [LocalDoc/azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) | 10% | Native general | |
|
|
| All datasets include reranker scores from `bge-reranker-v2-m3`, used as teacher signal for knowledge distillation. |
|
|
| ### Recipe |
|
|
| | Hyperparameter | Value | |
| |----------------|-------| |
| | Optimizer | AdamW | |
| | Learning rate | 1e-6 | |
| | Weight decay | 0.01 | |
| | Warmup ratio | 0.10 | |
| | Schedule | Cosine | |
| | Batch size | 16 (effective 32 via gradient accumulation) | |
| | Negatives per query (K) | 8 | |
| | False negative filter threshold | 0.9 × pos_score | |
| | Distillation alpha (KL weight) | 0.7 | |
| | Contrastive temperature | 0.05 | |
| | Teacher temperature | 1.0 | |
| | Mixed precision | BF16 | |
| | Epochs | 1 | |
| | Hardware | NVIDIA RTX 5090 (32GB) | |
| |
| ### Loss |
| |
| Combined KL distillation + InfoNCE: |
| |
| ``` |
| L = α × KL(softmax(student_scores) || softmax(teacher_scores)) + (1 − α) × InfoNCE |
| ``` |
| |
| where `α = 0.7` and student scores are computed via MaxSim over [pos, neg_1, ..., neg_K]. |
| |
| ## Evaluation |
| |
| ### Held-out validation |
| |
| Evaluated on 4,500 held-out triplets (1,500 per native source). Each query is ranked among 1 positive and 8 hard negatives. |
| |
| | Source | R@1 | R@3 | MRR | NDCG@10 | |
| |--------|-----|-----|-----|---------| |
| | Books | 0.5387 | 0.7693 | 0.6821 | 0.7584 | |
| | Legislation | 0.6633 | 0.8433 | 0.7679 | 0.8234 | |
| | Retriever (general) | 0.8340 | 0.9327 | 0.8901 | 0.9167 | |
| | **Macro average** | **0.6787** | **0.8484** | **0.7800** | **0.8328** | |
| |
| ### AZ-MIRAGE benchmark |
| |
| Evaluated on the [AZ-MIRAGE](https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) retrieval benchmark (7,373 queries, 40,448 document pool): |
| |
| | Metric | Score | |
| |--------|-------| |
| | P@1 | 0.3058 | |
| | R@5 | 0.7518 | |
| | R@10 | 0.8054 | |
| | NDCG@5 | 0.5528 | |
| | NDCG@10 | 0.5704 | |
| | MRR@10 | 0.4930 | |
| | F1@10 | 0.1464 | |
| |
| Comparison with bi-encoder models on AZ-MIRAGE: |
| |
| | Model | Params | NDCG@10 | MRR@10 | P@1 | |
| |-------|--------|---------|--------|-----| |
| | **ColBERT-AZ (this model)** | **165M** | **0.5704** | **0.4930** | **0.3058** | |
| | BAAI/bge-m3 | 568M | 0.5079 | 0.4204 | 0.2310 | |
| | google/gemini-embedding-2-preview | API | 0.5309 | 0.4372 | 0.2338 | |
| | perplexity/pplx-embed-v1-4b | API | 0.5225 | 0.4361 | 0.2470 | |
| | microsoft/harrier-oss-v1-0.6b | 600M | 0.5168 | 0.4321 | 0.2535 | |
| | intfloat/multilingual-e5-large | 560M | 0.4875 | 0.4043 | 0.2264 | |
| | intfloat/multilingual-e5-base | 278M | 0.4672 | 0.3852 | 0.2116 | |
| | sentence-transformers/LaBSE | 471M | 0.2472 | 0.1944 | 0.0943 | |
| |
| ## Usage |
| |
| This repository contains: |
| |
| - `config.json`, `model.safetensors`, `tokenizer.*` — encoder backbone (mmBERT-base-en-az) |
| - `projection.pt` — ColBERT linear projection layer (768 → 128, no bias) |
| |
| ColBERT requires both the backbone and the projection layer for correct inference. |
| |
| ### Loading the model |
| |
| ```python |
| import torch |
| import torch.nn as nn |
| import torch.nn.functional as F |
| from transformers import AutoTokenizer, AutoModel |
| |
| class ColBERT(nn.Module): |
| def __init__(self, model_name: str, embedding_dim: int = 128): |
| super().__init__() |
| self.backbone = AutoModel.from_pretrained(model_name) |
| self.projection = nn.Linear(self.backbone.config.hidden_size, embedding_dim, bias=False) |
| |
| @torch.no_grad() |
| def encode(self, input_ids, attention_mask, keep_mask=None): |
| out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, return_dict=True) |
| emb = self.projection(out.last_hidden_state) |
| emb = F.normalize(emb, p=2, dim=-1) |
| eff_mask = attention_mask if keep_mask is None else attention_mask * keep_mask |
| emb = emb * eff_mask.unsqueeze(-1).float() |
| return emb, eff_mask |
| |
| # Load |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| tokenizer = AutoTokenizer.from_pretrained("LocalDoc/colbert-az") |
|
|
| # Add ColBERT special tokens |
| tokenizer.add_special_tokens({"additional_special_tokens": ["[Q]", "[D]"]}) |
|
|
| model = ColBERT("LocalDoc/colbert-az") |
| model.backbone.resize_token_embeddings(len(tokenizer)) |
|
|
| # Load projection layer |
| from huggingface_hub import hf_hub_download |
| proj_path = hf_hub_download(repo_id="LocalDoc/colbert-az", filename="projection.pt") |
| model.projection.load_state_dict(torch.load(proj_path, map_location="cpu")) |
| |
| model = model.to(device).eval() |
| ``` |
| |
| ### Encoding queries and documents |
| |
| ```python |
| # Tokenization helpers |
| def tokenize_query(text: str, max_len: int = 32): |
| text = f"[Q] {text}" |
| enc = tokenizer(text, padding="max_length", truncation=True, |
| max_length=max_len, return_tensors="pt") |
| # ColBERT trick: replace pad with mask for query expansion |
| pad_mask = enc["input_ids"] == tokenizer.pad_token_id |
| enc["input_ids"][pad_mask] = tokenizer.mask_token_id |
| enc["attention_mask"] = torch.ones_like(enc["input_ids"]) |
| return enc |
| |
| def tokenize_doc(text: str, max_len: int = 256): |
| text = f"[D] {text}" |
| return tokenizer(text, padding=True, truncation=True, |
| max_length=max_len, return_tensors="pt") |
| |
| # Compute MaxSim score between query and a single document |
| def maxsim_score(query: str, document: str) -> float: |
| q_enc = {k: v.to(device) for k, v in tokenize_query(query).items()} |
| d_enc = {k: v.to(device) for k, v in tokenize_doc(document).items()} |
| |
| q_emb, _ = model.encode(q_enc["input_ids"], q_enc["attention_mask"]) |
| d_emb, d_mask = model.encode(d_enc["input_ids"], d_enc["attention_mask"]) |
| |
| # MaxSim: for each query token, take max similarity over doc tokens, then sum |
| sim = torch.einsum("qld,bnd->qlbn", q_emb, d_emb) |
| sim = sim.masked_fill(~d_mask.unsqueeze(0).unsqueeze(0).bool(), float("-inf")) |
| max_per_token, _ = sim.max(dim=-1) |
| score = max_per_token.sum(dim=1).item() |
| return score |
| |
| # Example |
| query = "Azərbaycan mədəniyyətinin tarixi" |
| doc = "Azərbaycan mədəniyyəti zəngin tarixə malikdir və qədim dövrlərdən başlayaraq inkişaf edib." |
| print(f"Score: {maxsim_score(query, doc):.4f}") |
| ``` |
| |
| ### Recommended retrieval pipeline |
| |
| For production retrieval, use ColBERT-AZ with a proper indexing library that supports late interaction: |
| |
| - [PLAID](https://github.com/stanford-futuredata/ColBERT) — official ColBERT indexing |
| - [pylate](https://github.com/lightonai/pylate) — modern ColBERT framework |
| |
| These libraries handle efficient indexing, scalable MaxSim retrieval, and quantization for production deployment. |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{colbert-az-2026, |
| title = {ColBERT-AZ: Late-Interaction Retrieval for Azerbaijani}, |
| author = {LocalDoc}, |
| year = {2026}, |
| url = {https://huggingface.co/LocalDoc/colbert-az} |
| } |
| ``` |
| |
| ## License |
| |
| Apache 2.0 |
| |
| ## Acknowledgements |
| |
| - Built on top of [mmBERT-base-en-az](https://huggingface.co/LocalDoc/mmBERT-base-en-az) |
| - Distilled from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) |
| - Original ColBERT architecture: Khattab & Zaharia (2020), [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832) |
| - ColBERTv2 distillation methodology: Santhanam et al. (2022), [ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction](https://arxiv.org/abs/2112.01488) |