--- license: mit datasets: - Salesforce/wikitext - kd13/stack-v2-mini language: - en metrics: - perplexity base_model: - answerdotai/ModernBERT-base new_version: kd13/ModernBERT-base-mlm-wiki-code pipeline_tag: fill-mask library_name: transformers tags: - code - mlm - wiki --- # ModernBERT-base-mlm-wiki-code ## Model Summary **ModernBERT-base-mlm-wiki-code** is a continued pre-trained version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), further trained on a large mixed corpus of English natural language text and multi-language source code using **Masked Language Modeling (MLM)**. The model was trained with a challenging **30% masking probability** (vs. the standard 15% in BERT) over a **2048 token context window**, achieving a final perplexity of **1.9507** — indicating strong and confident language + code understanding across both domains. --- ## Evaluation Results Evaluation was performed separately on each domain to understand per-domain performance. | Dataset | Eval Loss | Perplexity | |--------------------------------|-----------|------------| | WikiText (Natural Language) | 1.3994 | 4.0526 | | The Stack V2 (Code) | 0.4091 | 1.5054 | | **Combined (NL + Code)** | **0.6728**| **1.9598** | > MLM Probability: 30% — Context Length: 2048 tokens --- ## Model Details | Property | Value | |-----------------------|--------------------------------------------------| | **Base Model** | `answerdotai/ModernBERT-base` | | **Model Type** | Masked Language Model (MLM) | | **Architecture** | ModernBERT (Encoder-only Transformer) | | **Context Length** | 2048 tokens | | **MLM Probability** | 30% | | **Languages** | English, C++, Go, Java, JavaScript, Python | | **License** | MiT | --- ## Usage ### Installation ```bash pip install transformers torch ``` ### Load the Model ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") model = AutoModelForMaskedLM.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") model.config.reference_compile = False ``` ### Fill-Mask — Natural Language ```python from transformers import pipeline pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code") result = pipe("The capital of France is [MASK].") for r in result: print(f"{r['token_str']:15s} → {r['score']:.4f}") ``` ### Fill-Mask — Source Code ```python from transformers import pipeline pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code") result = pipe("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) [MASK] fibonacci(n-2)") for r in result: print(f"{r['token_str']:15s} → {r['score']:.4f}") ``` ### Feature Extraction (Embeddings) ```python from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") model = AutoModel.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code") text = "def quicksort(arr): return arr if len(arr) <= 1 else ..." inputs = tokenizer(text, return_tensors="pt", max_length=2048, truncation=True) with torch.no_grad(): outputs = model(**inputs) # CLS token embedding — shape: (1, 768) embedding = outputs.last_hidden_state[:, 0, :] print(embedding.shape) ``` --- ## Citation If you use this model, please cite the original ModernBERT paper: ```bibtex @article{modernbert2024, title = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference}, author = {Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli}, year = {2024}, url = {https://arxiv.org/abs/2412.13663} } ``` --- ## Acknowledgements - Base model by [Answer.AI](https://huggingface.co/answerdotai) - Training data from [WikiText](https://huggingface.co/datasets/wikimedia/wikipedia) and [BigCode The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2)