Update README.md

26b4ea5 verified about 10 hours ago

4.07 kB

	---
	license: apache-2.0
	language:
	- multilingual
	tags:
	- tokenizer
	- bpe
	- byte-level-bpe
	- chatml
	- routing
	- moe
	---

	Enjoy — We extend the JiRack Models Ecosystem! 🚀
	- The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution.
	- JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b

	# JiRack Router Tokenizer Pro

	High-performance custom Byte-Level BPE tokenizer trained on the full Wikipedia dump across multiple languages.

	Developed specifically for intelligent routing models and Mixture-of-Experts systems.


	## Model Details

	- Developer: Konstantin Grabko (JiRack)
	- License: Apache License 2.0
	- Training Data: Full Wikipedia multilingual dump
	- Vocabulary Size: 65,536 tokens
	- Special Tokens: 128 reserved tokens (including 40+ domain routing tokens)

	## Key Features

	- Correctly placed `<\|unk\|>` token at ID 0
	- Full native support for ChatML format (`<\|im_start\|>` / `<\|im_end\|>`)
	- Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.)
	- Support JiRack Robotics Technology with tags (`<\|action_start\|>` / `<\|action_end\|>`)
	- Support JiRack vision , images , audio-visual
	- Strong multilingual performance

	# Basic system and dialogue tokens
	"<\|unk\|>",
	"<\|endoftext\|>",
	"<\|padding\|>",
	"<\|im_start\|>",
	"<\|im_end\|>",
	# Core roles
	"<\|im_start\|>system",
	"<\|im_start\|>user",
	"<\|im_start\|>assistant",
	# Additional useful tokens
	"<\|im_start\|>tool",
	"<\|im_start\|>function",
	# Reasoning block
	"<\|im_start\|>thought",
	# Tool calls
	"<\|tool_call\|>",
	"<\|tool_response\|>",
	# Multimodality and audio-visual block
	"<\|image\|>",
	"<\|video\|>",
	"<\|sound\|>",
	"<\|voice\|>",
	"<\|listening\|>",
	"<\|vision\|>",
	# Emotional state (Mood)
	"<\|mood_happy\|>",
	"<\|mood_sad\|>",
	"<\|mood_angry\|>",
	"<\|mood_neutral\|>",
	# --- FIM (Fill-in-the-Middle) tokens from StarCoder ---
	"<\|fim_prefix\|>",
	"<\|fim_middle\|>",
	"<\|fim_suffix\|>",
	# Robotics (trajectory/command boundaries)
	"<\|action_start\|>",
	"<\|action_end\|>",


	# Highlights
	- Architecture: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
	- Form Factor: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation.
	- Chat Standard: Out-of-the-box support for ChatML formatting (`<\|im_start\|>` / `<\|im_end\|>`).
	- Domain Specialization: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc.

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")

	text = "<\|im_start\|>user\n__CODING__ __PYTHON__ Write a merge sort function.<\|im_end\|>"
	print(tokenizer.encode(text))


	### Benchmark for tokens quality .


	```bash
	=== Text after ChatML Template ===
	<\|im_start\|>system
	You are a precise router model.<\|im_end\|>
	<\|im_start\|>user
	__CODING__ __PYTHON__ Write a merge sort function in Python.<\|im_end\|>


	=== Tokens (IDs) ===
	[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]

	=== Decoding Token by Token ===
	[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
	5 -> '<\|im_start\|>system'
	326 -> '
	'
	5095 -> 'You'
	944 -> ' are'
	396 -> ' a'
	23348 -> ' precise'
	1021 -> ' ro'
	7831 -> 'uter'
	5869 -> ' model'
	141 -> '.'
	4 -> '<\|im_end\|>'
	326 -> '
	'
	6 -> '<\|im_start\|>user'
	326 -> '
	'
	29 -> '__CODING__'
	348 -> ' '
	44 -> '__PYTHON__'
	26876 -> ' Write'
	396 -> ' a'
	52698 -> ' merge'
	6521 -> ' sort'
	2031 -> ' function'
	460 -> ' in'
	7524 -> ' Python'
	141 -> '.'
	4 -> '<\|im_end\|>'
	326 -> '
	'