YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LilChatBot WordLevel Tokenizer

A WordLevel tokenizer trained for the LilChatBot project.

This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.


Design choices

  • WordLevel tokenization (no subword splitting)
  • Lowercasing
  • Unicode NFKC normalization
  • Apostrophes preserved everywhere
    (e.g. don't, lion's, 'hello', James')
  • Aggressive punctuation isolation, including:
    • sentence punctuation (. , ! ? ; :)
    • brackets (() [] {})
    • slashes (/)
    • double quotes (straight and curly)
    • en/em dashes (โ€“ โ€”)
  • Repeated punctuation collapsed
    (!!! โ†’ !, ??? โ†’ ?, ... โ†’ .)
  • English-focused

This tokenizer intentionally favors lexical transparency over vocabulary compactness.


Files

  • tokenizer.json โ€” complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)

The tokenizer can be used directly via the tokenizers library or wrapped for use with transformers.


Usage

With transformers

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")

print(tok.decode(tok("The lion's well-being matters โ€” donโ€™t forget that!").input_ids))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support