mmBERT-Embed-32K-2D-Matryoshka

A multilingual embedding model with 32K context window and 2D Matryoshka support for flexible efficiency-quality tradeoffs.

Model Highlights

Feature Value
Parameters 307M
Context Length 32,768 tokens
Languages 1800+ (via Glot500)
Embedding Dim 768 (supports 64-768 via Matryoshka)
Architecture ModernBERT encoder with YaRN scaling

Key Results

Metric Score
MTEB Mean (24 tasks) 61.4
STS Benchmark 80.5 (exceeds Qwen3-0.6B's 76.17)
Dimension Retention 99% @ 256d, 98% @ 64d
Layer Speedup 3.3× @ 6L, 5.8× @ 3L
Latency vs BGE-M3 1.6-3.1× faster (FA2 advantage)

What is 2D Matryoshka?

This model supports two dimensions of flexibility:

  1. Dimension Reduction (Matryoshka): Truncate embeddings to smaller dimensions with minimal quality loss
  2. Layer Reduction (Adaptive): Use intermediate layer outputs for faster inference
Config Quality Speedup Storage
22L, 768d 100% 1.0× 100%
22L, 256d 99% 1.0× 33%
22L, 64d 98% 1.0× 8%
6L, 768d 56% 3.3× 100%
6L, 256d 56% 3.3× 33%

Usage

Basic Usage (Sentence Transformers)

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode sentences
sentences = [
    "This is a test sentence.",
    "这是一个测试句子。",
    "Dies ist ein Testsatz.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (3, 768)

Matryoshka Dimension Reduction

import torch.nn.functional as F

# Encode with full dimensions
embeddings = model.encode(sentences, convert_to_tensor=True)

# Truncate to smaller dimension (e.g., 256)
embeddings_256d = embeddings[:, :256]
embeddings_256d = F.normalize(embeddings_256d, p=2, dim=1)

# Or truncate to 64 dimensions for maximum compression
embeddings_64d = embeddings[:, :64]
embeddings_64d = F.normalize(embeddings_64d, p=2, dim=1)

Long Context (up to 32K tokens)

# For long documents, set max_seq_length
model.max_seq_length = 8192  # or up to 32768

long_document = "..." * 10000  # Very long text
embedding = model.encode(long_document)

Layer Reduction (Advanced)

For latency-critical applications, you can extract embeddings from intermediate layers:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

model = AutoModel.from_pretrained(
    "llm-semantic-router/mmbert-embed-32k-2d-matryoshka",
    trust_remote_code=True,
    output_hidden_states=True
)
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/mmbert-embed-32k-2d-matryoshka")

# Encode
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    
    # Use layer 6 for 3.3× speedup (56% quality)
    hidden = outputs.hidden_states[6]
    hidden = model.final_norm(hidden)
    
    # Mean pooling
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    pooled = (hidden * mask).sum(1) / mask.sum(1)
    embeddings = F.normalize(pooled, p=2, dim=1)

Evaluation Results

MTEB Benchmark (24 tasks)

Category Score
STS (7 tasks) 79.3
Classification (6) 62.4
Pair Classification (2) 76.2
Reranking (2) 64.4
Clustering (4) 36.9
Retrieval (3) 38.2
Overall Mean 61.4

STS Benchmark

Model Parameters STS Score
Qwen3-Embed-0.6B 600M 76.17
mmBERT-Embed 307M 80.5
Qwen3-Embed-8B 8B 81.08

2D Matryoshka Quality Matrix (STS)

Layers 768d 256d 64d
22L 80.5 79.9 78.5
11L 53.7 48.0 44.4
6L 45.2 45.2 43.5
3L 44.0 44.1 41.8

Long-Context Retrieval (4K tokens)

Metric Score
R@1 68.8%
R@10 81.2%
MRR 71.9%

Throughput (AMD MI300X)

Layers Throughput Speedup
22L 477/s 1.0×
11L 916/s 1.9×
6L 1573/s 3.3×
3L 2761/s 5.8×

Latency Comparison vs BGE-M3 and Qwen3-Embedding-0.6B

mmBERT-Embed is significantly faster due to:

  1. Flash Attention 2 - BGE-M3 lacks FA2 (O(n) vs O(n²) attention)
  2. Encoder architecture - Qwen3 uses decoder with causal masking
  3. Smaller model - 307M vs 569M/600M params

Batch Size = 1

Seq Len mmBERT-Embed Qwen3-0.6B BGE-M3 mmBERT Speedup
512 17.6ms (57/s) 20.7ms (48/s) 10.8ms (93/s) 0.6×
1024 18.6ms (54/s) 21.2ms (47/s) 16.3ms (61/s) 0.9×
2048 19.5ms (51/s) 24.1ms (42/s) 31.1ms (32/s) 1.6×
4096 21.3ms (47/s) 43.5ms (23/s) 60.5ms (17/s) 2.8×

Batch Size = 8

Seq Len mmBERT-Embed Qwen3-0.6B BGE-M3 mmBERT Speedup
512 21.1ms (379/s) 33.0ms (243/s) 40.0ms (200/s) 1.9×
1024 34.5ms (232/s) 58.5ms (137/s) 77.4ms (103/s) 2.2×
2048 65.2ms (123/s) 117.0ms (68/s) 162.9ms (49/s) 2.5×
4096 130.7ms (61/s) 254.9ms (31/s) 411.3ms (19/s) 3.1×

Key insight: The FA2 advantage grows with sequence length and batch size:

  • At short sequences (512), BGE-M3 is faster (no FA2 overhead)
  • At 2K+ tokens, mmBERT pulls ahead significantly
  • At 4K batch=8: mmBERT is 3.1× faster than BGE-M3

Benchmarked on AMD MI300X, bf16 precision.

Training

Data

Trained on BAAI/bge-m3-data (73GB, 279 JSONL files) with:

  • Multilingual triplets (query, positive, negative)
  • Diverse domains and languages

Configuration

  • Base Model: llm-semantic-router/mmbert-32k-yarn
  • Loss: Matryoshka2dLoss (combines AdaptiveLayerLoss + MatryoshkaLoss)
  • Matryoshka Dimensions: [768, 512, 256, 128, 64]
  • Epochs: 1
  • Batch Size: 16 (effective 32 with gradient accumulation)
  • Learning Rate: 2e-5
  • Max Sequence Length: 32,768
  • Hardware: AMD Instinct MI300X

Use Cases

When to Use mmBERT-Embed

  1. Multilingual RAG for 1800+ languages (especially low-resource languages not covered by Qwen3 or BGE-M3)
  2. Long-document retrieval where chunking loses cross-section relationships
  3. Edge deployment where 307M params matters vs 600M+
  4. Flexible inference where you need to trade quality for speed/storage at runtime

When to Use Alternatives

  • Maximum quality on major languages: Qwen3-Embed-8B
  • Production stability: BGE-M3 (more battle-tested)
  • Very short texts only: Smaller models may suffice

Limitations

  • Layer reduction quality (56% at 6L) is lower than full model; use for latency-critical applications where moderate quality is acceptable
  • MTEB mean (61.4) is slightly below BGE-M3 (64.5) but with 4× longer context and 2D flexibility
  • Optimized for retrieval tasks; may need fine-tuning for other downstream tasks

Citation

@misc{mmbert-embed-2d-matryoshka,
  title={mmBERT-Embed: Multilingual Embedding Model with 2D Matryoshka Training},
  author={vLLM Semantic Router Team},
  year={2025},
  url={https://huggingface.co/llm-semantic-router/mmbert-embed-32k-2d-matryoshka}
}

License

Apache 2.0

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/mmbert-embed-32k-2d-matryoshka

Finetuned
(1)
this model

Evaluation results