Update README.md

8382ff6 verified 5 days ago

4.53 kB

	---
	license: mit
	datasets:
	- Salesforce/wikitext
	- kd13/stack-v2-mini
	language:
	- en
	metrics:
	- perplexity
	base_model:
	- answerdotai/ModernBERT-base
	new_version: kd13/ModernBERT-base-mlm-wiki-code
	pipeline_tag: fill-mask
	library_name: transformers
	tags:
	- code
	- mlm
	- wiki
	---

	# ModernBERT-base-mlm-wiki-code

	## Model Summary

	ModernBERT-base-mlm-wiki-code is a continued pre-trained version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), further trained on a large mixed corpus of English natural language text and multi-language source code using Masked Language Modeling (MLM).

	The model was trained with a challenging 30% masking probability (vs. the standard 15% in BERT) over a 2048 token context window, achieving a final perplexity of 1.9507 — indicating strong and confident language + code understanding across both domains.

	---

	## Evaluation Results

	Evaluation was performed separately on each domain to understand per-domain performance.

	\| Dataset \| Eval Loss \| Perplexity \|
	\|--------------------------------\|-----------\|------------\|
	\| WikiText (Natural Language) \| 1.3994 \| 4.0526 \|
	\| The Stack V2 (Code) \| 0.4091 \| 1.5054 \|
	\| Combined (NL + Code) \| 0.6728\| 1.9598 \|

	> MLM Probability: 30% — Context Length: 2048 tokens
	---

	## Model Details

	\| Property \| Value \|
	\|-----------------------\|--------------------------------------------------\|
	\| Base Model \| `answerdotai/ModernBERT-base` \|
	\| Model Type \| Masked Language Model (MLM) \|
	\| Architecture \| ModernBERT (Encoder-only Transformer) \|
	\| Context Length \| 2048 tokens \|
	\| MLM Probability \| 30% \|
	\| Languages \| English, C++, Go, Java, JavaScript, Python \|
	\| License \| MiT \|

	---

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Load the Model

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
	model = AutoModelForMaskedLM.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
	model.config.reference_compile = False
	```

	### Fill-Mask — Natural Language

	```python
	from transformers import pipeline

	pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")

	result = pipe("The capital of France is [MASK].")
	for r in result:
	print(f"{r['token_str']:15s} → {r['score']:.4f}")
	```

	### Fill-Mask — Source Code

	```python
	from transformers import pipeline

	pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")

	result = pipe("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) [MASK] fibonacci(n-2)")
	for r in result:
	print(f"{r['token_str']:15s} → {r['score']:.4f}")
	```

	### Feature Extraction (Embeddings)

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
	model = AutoModel.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")

	text = "def quicksort(arr): return arr if len(arr) <= 1 else ..."
	inputs = tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)

	with torch.no_grad():
	outputs = model(**inputs)

	# CLS token embedding — shape: (1, 768)
	embedding = outputs.last_hidden_state[:, 0, :]
	print(embedding.shape)
	```

	---

	## Citation

	If you use this model, please cite the original ModernBERT paper:

	```bibtex
	@article{modernbert2024,
	title = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
	author = {Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
	year = {2024},
	url = {https://arxiv.org/abs/2412.13663}
	}
	```

	---

	## Acknowledgements

	- Base model by [Answer.AI](https://huggingface.co/answerdotai)
	- Training data from [WikiText](https://huggingface.co/datasets/wikimedia/wikipedia) and [BigCode The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2)