| --- |
| license: mit |
| datasets: |
| - Salesforce/wikitext |
| - kd13/stack-v2-mini |
| language: |
| - en |
| metrics: |
| - perplexity |
| base_model: |
| - answerdotai/ModernBERT-base |
| new_version: kd13/ModernBERT-base-mlm-wiki-code |
| pipeline_tag: fill-mask |
| library_name: transformers |
| tags: |
| - code |
| - mlm |
| - wiki |
| --- |
| |
| # ModernBERT-base-mlm-wiki-code |
|
|
| ## Model Summary |
|
|
| **ModernBERT-base-mlm-wiki-code** is a continued pre-trained version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), further trained on a large mixed corpus of English natural language text and multi-language source code using **Masked Language Modeling (MLM)**. |
|
|
| The model was trained with a challenging **30% masking probability** (vs. the standard 15% in BERT) over a **2048 token context window**, achieving a final perplexity of **1.9507** — indicating strong and confident language + code understanding across both domains. |
|
|
| --- |
|
|
| ## Evaluation Results |
|
|
| Evaluation was performed separately on each domain to understand per-domain performance. |
|
|
| | Dataset | Eval Loss | Perplexity | |
| |--------------------------------|-----------|------------| |
| | WikiText (Natural Language) | 1.3994 | 4.0526 | |
| | The Stack V2 (Code) | 0.4091 | 1.5054 | |
| | **Combined (NL + Code)** | **0.6728**| **1.9598** | |
|
|
| > MLM Probability: 30% — Context Length: 2048 tokens |
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |-----------------------|--------------------------------------------------| |
| | **Base Model** | `answerdotai/ModernBERT-base` | |
| | **Model Type** | Masked Language Model (MLM) | |
| | **Architecture** | ModernBERT (Encoder-only Transformer) | |
| | **Context Length** | 2048 tokens | |
| | **MLM Probability** | 30% | |
| | **Languages** | English, C++, Go, Java, JavaScript, Python | |
| | **License** | MiT | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch |
| ``` |
|
|
| ### Load the Model |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForMaskedLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") |
| model = AutoModelForMaskedLM.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") |
| model.config.reference_compile = False |
| ``` |
|
|
| ### Fill-Mask — Natural Language |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code") |
| |
| result = pipe("The capital of France is [MASK].") |
| for r in result: |
| print(f"{r['token_str']:15s} → {r['score']:.4f}") |
| ``` |
|
|
| ### Fill-Mask — Source Code |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code") |
| |
| result = pipe("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) [MASK] fibonacci(n-2)") |
| for r in result: |
| print(f"{r['token_str']:15s} → {r['score']:.4f}") |
| ``` |
|
|
| ### Feature Extraction (Embeddings) |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") |
| model = AutoModel.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") |
| |
| text = "def quicksort(arr): return arr if len(arr) <= 1 else ..." |
| inputs = tokenizer(text, return_tensors="pt", max_length=2048, truncation=True) |
| |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| |
| # CLS token embedding — shape: (1, 768) |
| embedding = outputs.last_hidden_state[:, 0, :] |
| print(embedding.shape) |
| ``` |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite the original ModernBERT paper: |
|
|
| ```bibtex |
| @article{modernbert2024, |
| title = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, |
| author = {Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli}, |
| year = {2024}, |
| url = {https://arxiv.org/abs/2412.13663} |
| } |
| ``` |
|
|
| --- |
|
|
| ## Acknowledgements |
|
|
| - Base model by [Answer.AI](https://huggingface.co/answerdotai) |
| - Training data from [WikiText](https://huggingface.co/datasets/wikimedia/wikipedia) and [BigCode The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2) |