| --- |
| language: en |
| tags: |
| - tokenizers |
| - wordpiece |
| - bytepairencoding |
| - xlnet |
| - nlp |
| license: mit |
| --- |
| |
| # Basic Tokenizers Collection |
|
|
| This repository contains **three different tokenizers** trained and wrapped for experimentation and educational purposes: |
|
|
| ## π¦ Contents |
| - **WordPiece Tokenizer** |
| Path: `ByteMeHarder-404/tokenizers/wordpiece` |
| Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab. |
|
|
| - **Byte-Pair Encoding (BPE) Tokenizer** |
| Path: `ByteMeHarder-404/tokenizers/bpe` |
| Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes. |
|
|
| - **XLNet-Style Tokenizer** |
| Path: `ByteMeHarder-404/tokenizers/xlnet` |
| Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation. |
|
|
| ## π Usage |
| You can load each tokenizer with `transformers`: |
|
|
| ```python |
| from transformers import PreTrainedTokenizerFast |
| |
| # WordPiece |
| tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece") |
| |
| # BPE |
| tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe") |
| |
| # XLNet-style |
| tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet") |
| ``` |
|
|
| ## π Notes |
|
|
| - These tokenizers are minimal examples and **not pretrained with embeddings or models**. |
| - Intended for experimentation, educational purposes, and as a foundation for building custom models. |
| - You can extend them by training a new vocabulary on your dataset. |
|
|