mmBERT-base

Transformers v5 compatible checkpoint of jhu-clsp/mmBERT-base.

Parameters 307M
Hidden size 768
Layers 22
Attention heads 12
Max seq length 8,192
RoPE theta 160,000 (both global & local)

Usage (transformers v5)

from transformers import ModernBertModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertModel.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("์ธ๊ณต์ง€๋Šฅ ๊ธฐ์ˆ ์€ ๋น ๋ฅด๊ฒŒ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.", return_tensors="pt")
outputs = model(**inputs)

# [CLS] embedding (768-dim)
cls_embedding = outputs.last_hidden_state[:, 0, :]

For masked language modeling:

from transformers import ModernBertForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)

Migration Details

This checkpoint was migrated from jhu-clsp/mmBERT-base with the following changes:

1. Weight format: pytorch_model.bin โ†’ model.safetensors

  • Tied weights (model.embeddings.tok_embeddings.weight โ†” decoder.weight) were cloned to separate tensors before saving
  • All 138 tensors verified bitwise equal after conversion

2. Config: Added explicit rope_parameters for transformers v5

{
  "global_rope_theta": 160000,
  "local_rope_theta": 160000,
  "rope_parameters": {
    "full_attention": {"rope_type": "default", "rope_theta": 160000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
  }
}

The original flat fields (global_rope_theta, local_rope_theta) are preserved for backward compatibility. In transformers v5, ModernBertConfig defaults sliding_attention.rope_theta to 10,000 โ€” but mmBERT uses 160,000 for both, so explicit rope_parameters are required.

Verification

Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):

Check Result
RoPE config rope_parameters present, theta=160,000 for both attention types
Weight integrity 138 tensors bitwise equal (jhu-clsp .bin vs datalama .safetensors)
Inference output v4 vs v5 max diff across 4 multilingual sentences: 7.63e-06
Fine-tuning readiness Tokenizer roundtrip, forward+backward pass, gradient propagation โ€” all OK

Credit

Original model by JHU CLSP. See the original model card for training details and benchmarks.

Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for datalama/mmBERT-base

Finetuned
(58)
this model