Update README.md

a87320c verified 3 days ago

9.01 kB

language:
  - az
  - en
license: apache-2.0
library_name: pytorch
tags:
  - colbert
  - late-interaction
  - retrieval
  - information-retrieval
  - azerbaijani
  - multilingual
base_model: LocalDoc/mmBERT-base-en-az
pipeline_tag: feature-extraction

ColBERT-AZ

A late-interaction retrieval model for Azerbaijani built on top of mmBERT-base-en-az. Trained via cross-encoder distillation from bge-reranker-v2-m3 on a mix of native Azerbaijani and translated retrieval data.

ColBERT-AZ uses late interaction (token-level MaxSim scoring) rather than dense single-vector retrieval, providing higher precision in retrieval compared to bi-encoder models of similar or larger size.

Model Details

Property	Value
Parameters	165M
Embedding dim	128 (per token)
Backbone	mmBERT-base-en-az (ModernBERT)
Architecture	Late interaction (ColBERT)
Query max length	32 tokens
Document max length	256 tokens
Languages	Azerbaijani, English
Training epochs	1

Training

Data

ColBERT-AZ was trained on 3 million triplets sampled from a weighted mix of four reranked datasets:

Source	Weight	Type
LocalDoc/msmarco-az-reranked	50%	Translated web search
LocalDoc/azerbaijani_books_retriever_corpus-reranked	25%	Native literature
LocalDoc/azerbaijan_legislation_queries_passages	15%	Native legal
LocalDoc/azerbaijani_retriever_corpus-reranked	10%	Native general

All datasets include reranker scores from bge-reranker-v2-m3, used as teacher signal for knowledge distillation.

Recipe

Hyperparameter	Value
Optimizer	AdamW
Learning rate	1e-6
Weight decay	0.01
Warmup ratio	0.10
Schedule	Cosine
Batch size	16 (effective 32 via gradient accumulation)
Negatives per query (K)	8
False negative filter threshold	0.9 × pos_score
Distillation alpha (KL weight)	0.7
Contrastive temperature	0.05
Teacher temperature	1.0
Mixed precision	BF16
Epochs	1
Hardware	NVIDIA RTX 5090 (32GB)

Loss

Combined KL distillation + InfoNCE:

L = α × KL(softmax(student_scores) || softmax(teacher_scores)) + (1 − α) × InfoNCE

where α = 0.7 and student scores are computed via MaxSim over [pos, neg_1, ..., neg_K].

Evaluation

Held-out validation

Evaluated on 4,500 held-out triplets (1,500 per native source). Each query is ranked among 1 positive and 8 hard negatives.

Source	R@1	R@3	MRR	NDCG@10
Books	0.5387	0.7693	0.6821	0.7584
Legislation	0.6633	0.8433	0.7679	0.8234
Retriever (general)	0.8340	0.9327	0.8901	0.9167
Macro average	0.6787	0.8484	0.7800	0.8328

AZ-MIRAGE benchmark

Evaluated on the AZ-MIRAGE retrieval benchmark (7,373 queries, 40,448 document pool):

Metric	Score
P@1	0.3058
R@5	0.7518
R@10	0.8054
NDCG@5	0.5528
NDCG@10	0.5704
MRR@10	0.4930
F1@10	0.1464

Comparison with bi-encoder models on AZ-MIRAGE:

Model	Params	NDCG@10	MRR@10	P@1
ColBERT-AZ (this model)	165M	0.5704	0.4930	0.3058
BAAI/bge-m3	568M	0.5079	0.4204	0.2310
google/gemini-embedding-2-preview	API	0.5309	0.4372	0.2338
perplexity/pplx-embed-v1-4b	API	0.5225	0.4361	0.2470
microsoft/harrier-oss-v1-0.6b	600M	0.5168	0.4321	0.2535
intfloat/multilingual-e5-large	560M	0.4875	0.4043	0.2264
intfloat/multilingual-e5-base	278M	0.4672	0.3852	0.2116
sentence-transformers/LaBSE	471M	0.2472	0.1944	0.0943

Usage

This repository contains:

config.json, model.safetensors, tokenizer.* — encoder backbone (mmBERT-base-en-az)
projection.pt — ColBERT linear projection layer (768 → 128, no bias)

ColBERT requires both the backbone and the projection layer for correct inference.

Loading the model

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

class ColBERT(nn.Module):
    def __init__(self, model_name: str, embedding_dim: int = 128):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        self.projection = nn.Linear(self.backbone.config.hidden_size, embedding_dim, bias=False)

    @torch.no_grad()
    def encode(self, input_ids, attention_mask, keep_mask=None):
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
        emb = self.projection(out.last_hidden_state)
        emb = F.normalize(emb, p=2, dim=-1)
        eff_mask = attention_mask if keep_mask is None else attention_mask * keep_mask
        emb = emb * eff_mask.unsqueeze(-1).float()
        return emb, eff_mask

# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/colbert-az")

# Add ColBERT special tokens
tokenizer.add_special_tokens({"additional_special_tokens": ["[Q]", "[D]"]})

model = ColBERT("LocalDoc/colbert-az")
model.backbone.resize_token_embeddings(len(tokenizer))

# Load projection layer
from huggingface_hub import hf_hub_download
proj_path = hf_hub_download(repo_id="LocalDoc/colbert-az", filename="projection.pt")
model.projection.load_state_dict(torch.load(proj_path, map_location="cpu"))

model = model.to(device).eval()

Encoding queries and documents

# Tokenization helpers
def tokenize_query(text: str, max_len: int = 32):
    text = f"[Q] {text}"
    enc = tokenizer(text, padding="max_length", truncation=True,
                    max_length=max_len, return_tensors="pt")
    # ColBERT trick: replace pad with mask for query expansion
    pad_mask = enc["input_ids"] == tokenizer.pad_token_id
    enc["input_ids"][pad_mask] = tokenizer.mask_token_id
    enc["attention_mask"] = torch.ones_like(enc["input_ids"])
    return enc

def tokenize_doc(text: str, max_len: int = 256):
    text = f"[D] {text}"
    return tokenizer(text, padding=True, truncation=True,
                     max_length=max_len, return_tensors="pt")

# Compute MaxSim score between query and a single document
def maxsim_score(query: str, document: str) -> float:
    q_enc = {k: v.to(device) for k, v in tokenize_query(query).items()}
    d_enc = {k: v.to(device) for k, v in tokenize_doc(document).items()}

    q_emb, _ = model.encode(q_enc["input_ids"], q_enc["attention_mask"])
    d_emb, d_mask = model.encode(d_enc["input_ids"], d_enc["attention_mask"])

    # MaxSim: for each query token, take max similarity over doc tokens, then sum
    sim = torch.einsum("qld,bnd->qlbn", q_emb, d_emb)
    sim = sim.masked_fill(~d_mask.unsqueeze(0).unsqueeze(0).bool(), float("-inf"))
    max_per_token, _ = sim.max(dim=-1)
    score = max_per_token.sum(dim=1).item()
    return score

# Example
query = "Azərbaycan mədəniyyətinin tarixi"
doc = "Azərbaycan mədəniyyəti zəngin tarixə malikdir və qədim dövrlərdən başlayaraq inkişaf edib."
print(f"Score: {maxsim_score(query, doc):.4f}")

Recommended retrieval pipeline

For production retrieval, use ColBERT-AZ with a proper indexing library that supports late interaction:

PLAID — official ColBERT indexing
pylate — modern ColBERT framework

These libraries handle efficient indexing, scalable MaxSim retrieval, and quantization for production deployment.

Citation

@misc{colbert-az-2026,
  title  = {ColBERT-AZ: Late-Interaction Retrieval for Azerbaijani},
  author = {LocalDoc},
  year   = {2026},
  url    = {https://huggingface.co/LocalDoc/colbert-az}
}

License

Apache 2.0

Acknowledgements

Built on top of mmBERT-base-en-az
Distilled from bge-reranker-v2-m3
Original ColBERT architecture: Khattab & Zaharia (2020), ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
ColBERTv2 distillation methodology: Santhanam et al. (2022), ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction