mmBERT-Embed-32K-2D-Matryoshka

A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.

Model Highlights

Feature	Value
Parameters	307M
Context Length	32,768 tokens
Languages	1800+ (via Glot500)
Embedding Dim	768 (supports 64-768 via Matryoshka)
Architecture	ModernBERT encoder with YaRN scaling

Key Results

Metric	Score
MTEB Mean (24 tasks)	61.4
STS Benchmark	80.5 (exceeds Qwen3-0.6B's 76.17)
Dimension Retention	99% @ 256d, 98% @ 64d
Layer Speedup	3.3× @ 6L, 5.8× @ 3L
Latency vs BGE-M3	1.6-3.1× faster (FA2 advantage)

What is 2D Matryoshka?

This model supports two dimensions of flexibility:

Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference

Config	Quality	Speedup	Storage
22L, 768d	100%	1.0×	100%
22L, 256d	99%	1.0×	33%
22L, 64d	98%	1.0×	8%
6L, 768d	56%	3.3×	100%
6L, 256d	56%	3.3×	33%

Usage

Basic Usage (Sentence Transformers)

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode sentences
sentences = [
    "This is a test sentence.",
    "这是一个测试句子。",
    "Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 768)

Matryoshka Dimension Reduction

import torch.nn.functional as F

# Encode with full dimensions
embeddings = model.encode(sentences, convert_to_tensor=True)

# Truncate to smaller dimension (e.g., 256)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)

# Or truncate to 64 dimensions for maximum compression
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)

Long Context (up to 32K tokens)

# For long documents, set max_seq_length
model.max_seq_length = 8192  # or up to 32768

long_document = "..." * 10000  # Very long text
embedding = model.encode(long_document)

Layer Reduction (Advanced)

For latency-critical applications, you can extract embeddings from intermediate layers:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model = AutoModel.from_pretrained(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    trust_remote_code=True,
    output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
    # Use layer 6 for 3.3× speedup (56% quality)
    hidden = outputs.hidden_states[6]
    hidden = model.final_norm(hidden)
    
    # Mean pooling
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    pooled = (hidden * mask).sum(1) / mask.sum(1)
    embeddings = F.normalize(pooled, p=2, dim=1)

Evaluation Results

MTEB Benchmark (24 tasks)

Category	Score
STS (7 tasks)	79.3
Classification (6)	62.4
Pair Classification (2)	76.2
Reranking (2)	64.4
Clustering (4)	36.9
Retrieval (3)	38.2
Overall Mean	61.4

STS Benchmark

Model	Parameters	STS Score
Qwen3-Embed-0.6B	600M	76.17
mmBERT-Embed	307M	80.5
Qwen3-Embed-8B	8B	81.08

2D Matryoshka Quality Matrix (STS)

Layers	768d	256d	64d
22L	80.5	79.9	78.5
11L	53.7	48.0	44.4
6L	45.2	45.2	43.5
3L	44.0	44.1	41.8

Long-Context Retrieval (4K tokens)

Metric	Score
R@1	68.8%
R@10	81.2%
MRR	71.9%

Throughput (AMD MI300X)

Layers	Throughput	Speedup
22L	477/s	1.0×
11L	916/s	1.9×
6L	1573/s	3.3×
3L	2761/s	5.8×

Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B

mmBERT-Embed is significantly faster due to:

Flash Attention 2 - BGE-M3 lacks FA2 (O(n) vs O(n²) attention)
Encoder architecture - Qwen3 uses decoder with causal masking
Smaller model - 307M vs 569M/600M params

Batch Size = 1

Seq Len	mmBERT-Embed	Qwen3-0.6B	BGE-M3	mmBERT Speedup
512	17.6ms (57/s)	20.7ms (48/s)	10.8ms (93/s)	0.6×
1024	18.6ms (54/s)	21.2ms (47/s)	16.3ms (61/s)	0.9×
2048	19.5ms (51/s)	24.1ms (42/s)	31.1ms (32/s)	1.6×
4096	21.3ms (47/s)	43.5ms (23/s)	60.5ms (17/s)	2.8×

Batch Size = 8

Seq Len	mmBERT-Embed	Qwen3-0.6B	BGE-M3	mmBERT Speedup
512	21.1ms (379/s)	33.0ms (243/s)	40.0ms (200/s)	1.9×
1024	34.5ms (232/s)	58.5ms (137/s)	77.4ms (103/s)	2.2×
2048	65.2ms (123/s)	117.0ms (68/s)	162.9ms (49/s)	2.5×
4096	130.7ms (61/s)	254.9ms (31/s)	411.3ms (19/s)	3.1×

Key insight: The FA2 advantage grows with sequence length and batch size:

At short sequences (512), BGE-M3 is faster (no FA2 overhead)
At 2K+ tokens, mmBERT pulls ahead significantly
At 4K batch=8: mmBERT is 3.1× faster than BGE-M3

Benchmarked on AMD MI300X, bf16 precision.

Training

Data

Trained on BAAI/bge-m3-data (73GB, 279 JSONL files) with:

Multilingual triplets (query, positive, negative)
Diverse domains and languages

Configuration

Base Model: llm-semantic-router/mmbert-32k-yarn
Loss: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss)
Matryoshka Dimensions: [768, 512, 256, 128, 64]
Epochs: 1
Batch Size: 16 (effective 32 with gradient accumulation)
Learning Rate: 2e-5
Max Sequence Length: 32,768
Hardware: AMD Instinct MI300X

Use Cases

When to Use mmBERT-Embed

Multilingual RAG for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3)
Long-document retrieval where chunking loses cross-section relationships
Edge deployment where 307M params matters vs 600M+
Flexible inference where you need to trade quality for speed/storage at runtime

When to Use Alternatives

Maximum quality on major languages: Qwen3-Embed-8B
Production stability: BGE-M3 (more battle-tested)
Very short texts only: Smaller models may suffice

Limitations

Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable
MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility
Optimized for retrieval tasks; may need fine-tuning for other downstream tasks

Citation

@misc{mmbert-embed-2d-matryoshka,
  title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}

License

Apache 2.0

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

BF16

Model tree for llm-semantic-router/mmbert-embed-32k-2d-matryoshka

Base model

jhu-clsp/mmBERT-base

Finetuned

llm-semantic-router/mmbert-32k-yarn

Finetuned

(1)

this model

Evaluation results

spearman on STS Benchmark
self-reported

80.500