JiRack Router Tokenizer Pro

High-performance custom Byte-Level BPE tokenizer trained on the full Wikipedia dump across multiple languages.

Developed specifically for intelligent routing models and Mixture-of-Experts systems.

Model Details

Developer: Konstantin Grabko (JiRack)
License: Apache License 2.0
Training Data: Full Wikipedia multilingual dump
Vocabulary Size: 65,536 tokens
Special Tokens: 128 reserved tokens (including 40+ domain routing tokens)

Key Features

Correctly placed <|unk|> token at ID 0
Full native support for ChatML format (<|im_start|> / <|im_end|>)
Large set of specialized routing tokens (__CODING__, __MATH__, __PYTHON__, __SCIENCE__, etc.)
Support JiRack Robotics Technology with tags (<|action_start|> / <|action_end|>)
Support JiRack vision , images , audio-visual
Strong multilingual performance

Basic system and dialogue tokens

"<|unk|>", "<|endoftext|>", "<|padding|>", "<|im_start|>", "<|im_end|>",

Core roles

Additional useful tokens

"<|im_start|>tool", "<|im_start|>function",

Reasoning block

"<|im_start|>thought",

Tool calls

"<|tool_call|>", "<|tool_response|>",

Multimodality and audio-visual block

"<|image|>", "<|video|>", "<|sound|>", "<|voice|>", "<|listening|>", "<|vision|>",

Emotional state (Mood)

"<|mood_happy|>", "<|mood_sad|>", "<|mood_angry|>", "<|mood_neutral|>",

--- FIM (Fill-in-the-Middle) tokens from StarCoder ---

"<|fim_prefix|>", "<|fim_middle|>", "<|fim_suffix|>",

Robotics (trajectory/command boundaries)

"<|action_start|>", "<|action_end|>",

Highlights

Architecture: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
Form Factor: Fully wrapped into Hugging Face PreTrainedTokenizerFast with native ByteLevel decoders for clean cyrillic representation.
Chat Standard: Out-of-the-box support for ChatML formatting (<|im_start|> / <|im_end|>).
Domain Specialization: Pre-baked atomic routing tokens like __CODING__ and __PYTHON__ etc.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")

text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
print(tokenizer.encode(text))


### Benchmark for tokens  quality .


```bash
=== Text after ChatML Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>


=== Tokens (IDs) ===
[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]

=== Decoding Token by Token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5095 -> 'You'
944 -> ' are'
396 -> ' a'
23348 -> ' precise'
1021 -> ' ro'
7831 -> 'uter'
5869 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
29 -> '__CODING__'
348 -> ' '
44 -> '__PYTHON__'
26876 -> ' Write'
396 -> ' a'
52698 -> ' merge'
6521 -> ' sort'
2031 -> ' function'
460 -> ' in'
7524 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support