YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LilChatBot WordLevel Tokenizer
A WordLevel tokenizer trained for the LilChatBot project.
This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.
Design choices
- WordLevel tokenization (no subword splitting)
- Lowercasing
- Unicode NFKC normalization
- Apostrophes preserved everywhere
(e.g.don't,lion's,'hello',James') - Aggressive punctuation isolation, including:
- sentence punctuation (
. , ! ? ; :) - brackets (
() [] {}) - slashes (
/) - double quotes (straight and curly)
- en/em dashes (
โ โ)
- sentence punctuation (
- Repeated punctuation collapsed
(!!! โ !,??? โ ?,... โ .) - English-focused
This tokenizer intentionally favors lexical transparency over vocabulary compactness.
Files
tokenizer.jsonโ complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)
The tokenizer can be used directly via the tokenizers library or wrapped for use with transformers.
Usage
With transformers
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")
print(tok.decode(tok("The lion's well-being matters โ donโt forget that!").input_ids))
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support