Instructions to use answerdotai/ModernBERT-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use answerdotai/ModernBERT-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="answerdotai/ModernBERT-large")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large") model = AutoModelForMaskedLM.from_pretrained("answerdotai/ModernBERT-large") - Notebooks
- Google Colab
- Kaggle
Why add_prefix_space=false?
#5
by hankcs - opened
Hi, thank you for sharing your work, it's great!
I've a querstion regarding the BPE tokenizer. I saw its add_prefix_space is set to false, which means the same word in a text will be tokenized differently depending on its position. E.g., consider the word "Hello" in the text:
Hello everyone. Hello world.
will be tokenized to:
['[CLS]', 'Hello', 'Ġeveryone', '.', 'ĠHello', 'Ġworld', '.', '[SEP]']
This leads to redundant vocabulary and conflicting semantics: a token without prefix space is usually a subtoken, but here "Hello" doesn't have prefix space.
How did you tokenize your pretraining corpus? Is it a mistake to set add_prefix_space to false during the conversion of tokenizers?