Fill-Mask
Transformers
Safetensors
English
modernbert
code
mlm
wiki
kd13 commited on
Commit
8382ff6
·
verified ·
1 Parent(s): 6937536

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -3
README.md CHANGED
@@ -1,3 +1,140 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - Salesforce/wikitext
5
+ - kd13/stack-v2-mini
6
+ language:
7
+ - en
8
+ metrics:
9
+ - perplexity
10
+ base_model:
11
+ - answerdotai/ModernBERT-base
12
+ new_version: kd13/ModernBERT-base-mlm-wiki-code
13
+ pipeline_tag: fill-mask
14
+ library_name: transformers
15
+ tags:
16
+ - code
17
+ - mlm
18
+ - wiki
19
+ ---
20
+
21
+ # ModernBERT-base-mlm-wiki-code
22
+
23
+ ## Model Summary
24
+
25
+ **ModernBERT-base-mlm-wiki-code** is a continued pre-trained version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), further trained on a large mixed corpus of English natural language text and multi-language source code using **Masked Language Modeling (MLM)**.
26
+
27
+ The model was trained with a challenging **30% masking probability** (vs. the standard 15% in BERT) over a **2048 token context window**, achieving a final perplexity of **1.9507** — indicating strong and confident language + code understanding across both domains.
28
+
29
+ ---
30
+
31
+ ## Evaluation Results
32
+
33
+ Evaluation was performed separately on each domain to understand per-domain performance.
34
+
35
+ | Dataset | Eval Loss | Perplexity |
36
+ |--------------------------------|-----------|------------|
37
+ | WikiText (Natural Language) | 1.3994 | 4.0526 |
38
+ | The Stack V2 (Code) | 0.4091 | 1.5054 |
39
+ | **Combined (NL + Code)** | **0.6728**| **1.9598** |
40
+
41
+ > MLM Probability: 30% — Context Length: 2048 tokens
42
+ ---
43
+
44
+ ## Model Details
45
+
46
+ | Property | Value |
47
+ |-----------------------|--------------------------------------------------|
48
+ | **Base Model** | `answerdotai/ModernBERT-base` |
49
+ | **Model Type** | Masked Language Model (MLM) |
50
+ | **Architecture** | ModernBERT (Encoder-only Transformer) |
51
+ | **Context Length** | 2048 tokens |
52
+ | **MLM Probability** | 30% |
53
+ | **Languages** | English, C++, Go, Java, JavaScript, Python |
54
+ | **License** | MiT |
55
+
56
+ ---
57
+
58
+ ## Usage
59
+
60
+ ### Installation
61
+
62
+ ```bash
63
+ pip install transformers torch
64
+ ```
65
+
66
+ ### Load the Model
67
+
68
+ ```python
69
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
70
+
71
+ tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
72
+ model = AutoModelForMaskedLM.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
73
+ model.config.reference_compile = False
74
+ ```
75
+
76
+ ### Fill-Mask — Natural Language
77
+
78
+ ```python
79
+ from transformers import pipeline
80
+
81
+ pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")
82
+
83
+ result = pipe("The capital of France is [MASK].")
84
+ for r in result:
85
+ print(f"{r['token_str']:15s} → {r['score']:.4f}")
86
+ ```
87
+
88
+ ### Fill-Mask — Source Code
89
+
90
+ ```python
91
+ from transformers import pipeline
92
+
93
+ pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")
94
+
95
+ result = pipe("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) [MASK] fibonacci(n-2)")
96
+ for r in result:
97
+ print(f"{r['token_str']:15s} → {r['score']:.4f}")
98
+ ```
99
+
100
+ ### Feature Extraction (Embeddings)
101
+
102
+ ```python
103
+ from transformers import AutoTokenizer, AutoModel
104
+ import torch
105
+
106
+ tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
107
+ model = AutoModel.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
108
+
109
+ text = "def quicksort(arr): return arr if len(arr) <= 1 else ..."
110
+ inputs = tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
111
+
112
+ with torch.no_grad():
113
+ outputs = model(**inputs)
114
+
115
+ # CLS token embedding — shape: (1, 768)
116
+ embedding = outputs.last_hidden_state[:, 0, :]
117
+ print(embedding.shape)
118
+ ```
119
+
120
+ ---
121
+
122
+ ## Citation
123
+
124
+ If you use this model, please cite the original ModernBERT paper:
125
+
126
+ ```bibtex
127
+ @article{modernbert2024,
128
+ title = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
129
+ author = {Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
130
+ year = {2024},
131
+ url = {https://arxiv.org/abs/2412.13663}
132
+ }
133
+ ```
134
+
135
+ ---
136
+
137
+ ## Acknowledgements
138
+
139
+ - Base model by [Answer.AI](https://huggingface.co/answerdotai)
140
+ - Training data from [WikiText](https://huggingface.co/datasets/wikimedia/wikipedia) and [BigCode The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2)