mmBERT-Embed-32K-2D-Matryoshka
A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.
Model Highlights
| Feature | Value |
|---|---|
| Parameters | 307M |
| Context Length | 32,768 tokens |
| Languages | 1800+ (via Glot500) |
| Embedding Dim | 768 (supports 64-768 via Matryoshka) |
| Architecture | ModernBERT encoder with YaRN scaling |
Key Results
| Metric | Score |
|---|---|
| MTEB Mean (24 tasks) | 61.4 |
| STS Benchmark | 80.5 (exceeds Qwen3-0.6B's 76.17) |
| Dimension Retention | 99% @ 256d, 98% @ 64d |
| Layer Speedup | 3.3× @ 6L, 5.8× @ 3L |
| Latency vs BGE-M3 | 1.6-3.1× faster (FA2 advantage) |
What is 2D Matryoshka?
This model supports two dimensions of flexibility:
- Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
- Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference
| Config | Quality | Speedup | Storage |
|---|---|---|---|
| 22L, 768d | 100% | 1.0× | 100% |
| 22L, 256d | 99% | 1.0× | 33% |
| 22L, 64d | 98% | 1.0× | 8% |
| 6L, 768d | 56% | 3.3× | 100% |
| 6L, 256d | 56% | 3.3× | 33% |
Usage
Basic Usage (Sentence Transformers)
from sentence_transformers import SentenceTransformer
# Load model
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")
# Encode sentences
sentences = [
"This is a test sentence.",
"这是一个测试句子。",
"Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape) # (3, 768)
Matryoshka Dimension Reduction
import torch.nn.functional as F
# Encode with full dimensions
embeddings = model.encode(sentences, convert_to_tensor=True)
# Truncate to smaller dimension (e.g., 256)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)
# Or truncate to 64 dimensions for maximum compression
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)
Long Context (up to 32K tokens)
# For long documents, set max_seq_length
model.max_seq_length = 8192 # or up to 32768
long_document = "..." * 10000 # Very long text
embedding = model.encode(long_document)
Layer Reduction (Advanced)
For latency-critical applications, you can extract embeddings from intermediate layers:
from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model = AutoModel.from_pretrained(
"llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
trust_remote_code=True,
output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")
# Encode
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# Use layer 6 for 3.3× speedup (56% quality)
hidden = outputs.hidden_states[6]
hidden = model.final_norm(hidden)
# Mean pooling
mask = inputs["attention_mask"].unsqueeze(-1).float()
pooled = (hidden * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
Evaluation Results
MTEB Benchmark (24 tasks)
| Category | Score |
|---|---|
| STS (7 tasks) | 79.3 |
| Classification (6) | 62.4 |
| Pair Classification (2) | 76.2 |
| Reranking (2) | 64.4 |
| Clustering (4) | 36.9 |
| Retrieval (3) | 38.2 |
| Overall Mean | 61.4 |
STS Benchmark
| Model | Parameters | STS Score |
|---|---|---|
| Qwen3-Embed-0.6B | 600M | 76.17 |
| mmBERT-Embed | 307M | 80.5 |
| Qwen3-Embed-8B | 8B | 81.08 |
2D Matryoshka Quality Matrix (STS)
| Layers | 768d | 256d | 64d |
|---|---|---|---|
| 22L | 80.5 | 79.9 | 78.5 |
| 11L | 53.7 | 48.0 | 44.4 |
| 6L | 45.2 | 45.2 | 43.5 |
| 3L | 44.0 | 44.1 | 41.8 |
Long-Context Retrieval (4K tokens)
| Metric | Score |
|---|---|
| R@1 | 68.8% |
| R@10 | 81.2% |
| MRR | 71.9% |
Throughput (AMD MI300X)
| Layers | Throughput | Speedup |
|---|---|---|
| 22L | 477/s | 1.0× |
| 11L | 916/s | 1.9× |
| 6L | 1573/s | 3.3× |
| 3L | 2761/s | 5.8× |
Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B
mmBERT-Embed is significantly faster due to:
- Flash Attention 2 - BGE-M3 lacks FA2 (O(n) vs O(n²) attention)
- Encoder architecture - Qwen3 uses decoder with causal masking
- Smaller model - 307M vs 569M/600M params
Batch Size = 1
| Seq Len | mmBERT-Embed | Qwen3-0.6B | BGE-M3 | mmBERT Speedup |
|---|---|---|---|---|
| 512 | 17.6ms (57/s) | 20.7ms (48/s) | 10.8ms (93/s) | 0.6× |
| 1024 | 18.6ms (54/s) | 21.2ms (47/s) | 16.3ms (61/s) | 0.9× |
| 2048 | 19.5ms (51/s) | 24.1ms (42/s) | 31.1ms (32/s) | 1.6× |
| 4096 | 21.3ms (47/s) | 43.5ms (23/s) | 60.5ms (17/s) | 2.8× |
Batch Size = 8
| Seq Len | mmBERT-Embed | Qwen3-0.6B | BGE-M3 | mmBERT Speedup |
|---|---|---|---|---|
| 512 | 21.1ms (379/s) | 33.0ms (243/s) | 40.0ms (200/s) | 1.9× |
| 1024 | 34.5ms (232/s) | 58.5ms (137/s) | 77.4ms (103/s) | 2.2× |
| 2048 | 65.2ms (123/s) | 117.0ms (68/s) | 162.9ms (49/s) | 2.5× |
| 4096 | 130.7ms (61/s) | 254.9ms (31/s) | 411.3ms (19/s) | 3.1× |
Key insight: The FA2 advantage grows with sequence length and batch size:
- At short sequences (512), BGE-M3 is faster (no FA2 overhead)
- At 2K+ tokens, mmBERT pulls ahead significantly
- At 4K batch=8: mmBERT is 3.1× faster than BGE-M3
Benchmarked on AMD MI300X, bf16 precision.
Training
Data
Trained on BAAI/bge-m3-data (73GB, 279 JSONL files) with:
- Multilingual triplets (query, positive, negative)
- Diverse domains and languages
Configuration
- Base Model: llm-semantic-router/mmbert-32k-yarn
- Loss: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss)
- Matryoshka Dimensions: [768, 512, 256, 128, 64]
- Epochs: 1
- Batch Size: 16 (effective 32 with gradient accumulation)
- Learning Rate: 2e-5
- Max Sequence Length: 32,768
- Hardware: AMD Instinct MI300X
Use Cases
When to Use mmBERT-Embed
- Multilingual RAG for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3)
- Long-document retrieval where chunking loses cross-section relationships
- Edge deployment where 307M params matters vs 600M+
- Flexible inference where you need to trade quality for speed/storage at runtime
When to Use Alternatives
- Maximum quality on major languages: Qwen3-Embed-8B
- Production stability: BGE-M3 (more battle-tested)
- Very short texts only: Smaller models may suffice
Limitations
- Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable
- MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility
- Optimized for retrieval tasks; may need fine-tuning for other downstream tasks
Citation
@misc{mmbert-embed-2d-matryoshka,
title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
author={vLLM Semantic Router Team},
year={2025},
url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}
License
Apache 2.0
- Downloads last month
- -
Model tree for llm-semantic-router/mmbert-embed-32k-2d-matryoshka
Base model
jhu-clsp/mmBERT-base
Finetuned
llm-semantic-router/mmbert-32k-yarn
Evaluation results
- spearman on STS Benchmarkself-reported80.500