ByteMeHarder-404
/

tokenizers

bytepairencoding

Model card Files Files and versions

tokenizers / README.md

ByteMeHarder-404's picture

ByteMeHarder-404

Create README.md

e68e4d0 verified 7 months ago

|

history blame contribute delete

1.59 kB

	---
	language: en
	tags:
	- tokenizers
	- wordpiece
	- bytepairencoding
	- xlnet
	- nlp
	license: mit
	---

	# Basic Tokenizers Collection

	This repository contains three different tokenizers trained and wrapped for experimentation and educational purposes:

	## 📦 Contents
	- WordPiece Tokenizer
	Path: `ByteMeHarder-404/tokenizers/wordpiece`
	Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab.

	- Byte-Pair Encoding (BPE) Tokenizer
	Path: `ByteMeHarder-404/tokenizers/bpe`
	Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes.

	- XLNet-Style Tokenizer
	Path: `ByteMeHarder-404/tokenizers/xlnet`
	Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation.

	## 🚀 Usage
	You can load each tokenizer with `transformers`:

	```python
	from transformers import PreTrainedTokenizerFast

	# WordPiece
	tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece")

	# BPE
	tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe")

	# XLNet-style
	tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet")
	```

	## 📚 Notes

	- These tokenizers are minimal examples and not pretrained with embeddings or models.
	- Intended for experimentation, educational purposes, and as a foundation for building custom models.
	- You can extend them by training a new vocabulary on your dataset.