Update README.md

a87320c verified 3 days ago

9.01 kB

	---
	language:
	- az
	- en
	license: apache-2.0
	library_name: pytorch
	tags:
	- colbert
	- late-interaction
	- retrieval
	- information-retrieval
	- azerbaijani
	- multilingual
	base_model: LocalDoc/mmBERT-base-en-az
	pipeline_tag: feature-extraction
	---

	# ColBERT-AZ

	A late-interaction retrieval model for Azerbaijani built on top of [mmBERT-base-en-az](https://huggingface.co/LocalDoc/mmBERT-base-en-az). Trained via cross-encoder distillation from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) on a mix of native Azerbaijani and translated retrieval data.

	ColBERT-AZ uses late interaction (token-level MaxSim scoring) rather than dense single-vector retrieval, providing higher precision in retrieval compared to bi-encoder models of similar or larger size.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Parameters \| 165M \|
	\| Embedding dim \| 128 (per token) \|
	\| Backbone \| mmBERT-base-en-az (ModernBERT) \|
	\| Architecture \| Late interaction (ColBERT) \|
	\| Query max length \| 32 tokens \|
	\| Document max length \| 256 tokens \|
	\| Languages \| Azerbaijani, English \|
	\| Training epochs \| 1 \|

	## Training

	### Data

	ColBERT-AZ was trained on 3 million triplets sampled from a weighted mix of four reranked datasets:

	\| Source \| Weight \| Type \|
	\|--------\|--------\|------\|
	\| [LocalDoc/msmarco-az-reranked](https://huggingface.co/datasets/LocalDoc/msmarco-az-reranked) \| 50% \| Translated web search \|
	\| [LocalDoc/azerbaijani_books_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_books_retriever_corpus-reranked) \| 25% \| Native literature \|
	\| [LocalDoc/azerbaijan_legislation_queries_passages](https://huggingface.co/datasets/LocalDoc/azerbaijan_legislation_queries_passages) \| 15% \| Native legal \|
	\| [LocalDoc/azerbaijani_retriever_corpus-reranked](https://huggingface.co/datasets/LocalDoc/azerbaijani_retriever_corpus-reranked) \| 10% \| Native general \|

	All datasets include reranker scores from `bge-reranker-v2-m3`, used as teacher signal for knowledge distillation.

	### Recipe

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Optimizer \| AdamW \|
	\| Learning rate \| 1e-6 \|
	\| Weight decay \| 0.01 \|
	\| Warmup ratio \| 0.10 \|
	\| Schedule \| Cosine \|
	\| Batch size \| 16 (effective 32 via gradient accumulation) \|
	\| Negatives per query (K) \| 8 \|
	\| False negative filter threshold \| 0.9 × pos_score \|
	\| Distillation alpha (KL weight) \| 0.7 \|
	\| Contrastive temperature \| 0.05 \|
	\| Teacher temperature \| 1.0 \|
	\| Mixed precision \| BF16 \|
	\| Epochs \| 1 \|
	\| Hardware \| NVIDIA RTX 5090 (32GB) \|

	### Loss

	Combined KL distillation + InfoNCE:

	```
	L = α × KL(softmax(student_scores) \|\| softmax(teacher_scores)) + (1 − α) × InfoNCE
	```

	where `α = 0.7` and student scores are computed via MaxSim over [pos, neg_1, ..., neg_K].

	## Evaluation

	### Held-out validation

	Evaluated on 4,500 held-out triplets (1,500 per native source). Each query is ranked among 1 positive and 8 hard negatives.

	\| Source \| R@1 \| R@3 \| MRR \| NDCG@10 \|
	\|--------\|-----\|-----\|-----\|---------\|
	\| Books \| 0.5387 \| 0.7693 \| 0.6821 \| 0.7584 \|
	\| Legislation \| 0.6633 \| 0.8433 \| 0.7679 \| 0.8234 \|
	\| Retriever (general) \| 0.8340 \| 0.9327 \| 0.8901 \| 0.9167 \|
	\| Macro average \| 0.6787 \| 0.8484 \| 0.7800 \| 0.8328 \|

	### AZ-MIRAGE benchmark

	Evaluated on the [AZ-MIRAGE](https://github.com/LocalDoc-Azerbaijan/AZ-MIRAGE) retrieval benchmark (7,373 queries, 40,448 document pool):

	\| Metric \| Score \|
	\|--------\|-------\|
	\| P@1 \| 0.3058 \|
	\| R@5 \| 0.7518 \|
	\| R@10 \| 0.8054 \|
	\| NDCG@5 \| 0.5528 \|
	\| NDCG@10 \| 0.5704 \|
	\| MRR@10 \| 0.4930 \|
	\| F1@10 \| 0.1464 \|

	Comparison with bi-encoder models on AZ-MIRAGE:

	\| Model \| Params \| NDCG@10 \| MRR@10 \| P@1 \|
	\|-------\|--------\|---------\|--------\|-----\|
	\| ColBERT-AZ (this model) \| 165M \| 0.5704 \| 0.4930 \| 0.3058 \|
	\| BAAI/bge-m3 \| 568M \| 0.5079 \| 0.4204 \| 0.2310 \|
	\| google/gemini-embedding-2-preview \| API \| 0.5309 \| 0.4372 \| 0.2338 \|
	\| perplexity/pplx-embed-v1-4b \| API \| 0.5225 \| 0.4361 \| 0.2470 \|
	\| microsoft/harrier-oss-v1-0.6b \| 600M \| 0.5168 \| 0.4321 \| 0.2535 \|
	\| intfloat/multilingual-e5-large \| 560M \| 0.4875 \| 0.4043 \| 0.2264 \|
	\| intfloat/multilingual-e5-base \| 278M \| 0.4672 \| 0.3852 \| 0.2116 \|
	\| sentence-transformers/LaBSE \| 471M \| 0.2472 \| 0.1944 \| 0.0943 \|

	## Usage

	This repository contains:

	- `config.json`, `model.safetensors`, `tokenizer.*` — encoder backbone (mmBERT-base-en-az)
	- `projection.pt` — ColBERT linear projection layer (768 → 128, no bias)

	ColBERT requires both the backbone and the projection layer for correct inference.

	### Loading the model

	```python
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	from transformers import AutoTokenizer, AutoModel

	class ColBERT(nn.Module):
	def __init__(self, model_name: str, embedding_dim: int = 128):
	super().__init__()
	self.backbone = AutoModel.from_pretrained(model_name)
	self.projection = nn.Linear(self.backbone.config.hidden_size, embedding_dim, bias=False)

	@torch.no_grad()
	def encode(self, input_ids, attention_mask, keep_mask=None):
	out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
	emb = self.projection(out.last_hidden_state)
	emb = F.normalize(emb, p=2, dim=-1)
	eff_mask = attention_mask if keep_mask is None else attention_mask * keep_mask
	emb = emb * eff_mask.unsqueeze(-1).float()
	return emb, eff_mask

	# Load
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	tokenizer = AutoTokenizer.from_pretrained("LocalDoc/colbert-az")

	# Add ColBERT special tokens
	tokenizer.add_special_tokens({"additional_special_tokens": ["[Q]", "[D]"]})

	model = ColBERT("LocalDoc/colbert-az")
	model.backbone.resize_token_embeddings(len(tokenizer))

	# Load projection layer
	from huggingface_hub import hf_hub_download
	proj_path = hf_hub_download(repo_id="LocalDoc/colbert-az", filename="projection.pt")
	model.projection.load_state_dict(torch.load(proj_path, map_location="cpu"))

	model = model.to(device).eval()
	```

	### Encoding queries and documents

	```python
	# Tokenization helpers
	def tokenize_query(text: str, max_len: int = 32):
	text = f"[Q] {text}"
	enc = tokenizer(text, padding="max_length", truncation=True,
	max_length=max_len, return_tensors="pt")
	# ColBERT trick: replace pad with mask for query expansion
	pad_mask = enc["input_ids"] == tokenizer.pad_token_id
	enc["input_ids"][pad_mask] = tokenizer.mask_token_id
	enc["attention_mask"] = torch.ones_like(enc["input_ids"])
	return enc

	def tokenize_doc(text: str, max_len: int = 256):
	text = f"[D] {text}"
	return tokenizer(text, padding=True, truncation=True,
	max_length=max_len, return_tensors="pt")

	# Compute MaxSim score between query and a single document
	def maxsim_score(query: str, document: str) -> float:
	q_enc = {k: v.to(device) for k, v in tokenize_query(query).items()}
	d_enc = {k: v.to(device) for k, v in tokenize_doc(document).items()}

	q_emb, _ = model.encode(q_enc["input_ids"], q_enc["attention_mask"])
	d_emb, d_mask = model.encode(d_enc["input_ids"], d_enc["attention_mask"])

	# MaxSim: for each query token, take max similarity over doc tokens, then sum
	sim = torch.einsum("qld,bnd->qlbn", q_emb, d_emb)
	sim = sim.masked_fill(~d_mask.unsqueeze(0).unsqueeze(0).bool(), float("-inf"))
	max_per_token, _ = sim.max(dim=-1)
	score = max_per_token.sum(dim=1).item()
	return score

	# Example
	query = "Azərbaycan mədəniyyətinin tarixi"
	doc = "Azərbaycan mədəniyyəti zəngin tarixə malikdir və qədim dövrlərdən başlayaraq inkişaf edib."
	print(f"Score: {maxsim_score(query, doc):.4f}")
	```

	### Recommended retrieval pipeline

	For production retrieval, use ColBERT-AZ with a proper indexing library that supports late interaction:

	- [PLAID](https://github.com/stanford-futuredata/ColBERT) — official ColBERT indexing
	- [pylate](https://github.com/lightonai/pylate) — modern ColBERT framework

	These libraries handle efficient indexing, scalable MaxSim retrieval, and quantization for production deployment.

	## Citation

	```bibtex
	@misc{colbert-az-2026,
	title = {ColBERT-AZ: Late-Interaction Retrieval for Azerbaijani},
	author = {LocalDoc},
	year = {2026},
	url = {https://huggingface.co/LocalDoc/colbert-az}
	}
	```

	## License

	Apache 2.0

	## Acknowledgements

	- Built on top of [mmBERT-base-en-az](https://huggingface.co/LocalDoc/mmBERT-base-en-az)
	- Distilled from [bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3)
	- Original ColBERT architecture: Khattab & Zaharia (2020), [ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT](https://arxiv.org/abs/2004.12832)
	- ColBERTv2 distillation methodology: Santhanam et al. (2022), [ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction](https://arxiv.org/abs/2112.01488)