Why add_prefix_space=false?

by hankcs - opened Dec 21, 2024

Dec 21, 2024

Hi, thank you for sharing your work, it's great!

I've a querstion regarding the BPE tokenizer. I saw its add_prefix_space is set to false, which means the same word in a text will be tokenized differently depending on its position. E.g., consider the word "Hello" in the text:

Hello everyone. Hello world.

will be tokenized to:

['[CLS]', 'Hello', 'Ġeveryone', '.', 'ĠHello', 'Ġworld', '.', '[SEP]']

This leads to redundant vocabulary and conflicting semantics: a token without prefix space is usually a subtoken, but here "Hello" doesn't have prefix space.

How did you tokenize your pretraining corpus? Is it a mistake to set add_prefix_space to false during the conversion of tokenizers?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment