| --- |
| license: apache-2.0 |
| language: |
| - multilingual |
| tags: |
| - tokenizer |
| - bpe |
| - byte-level-bpe |
| - chatml |
| - routing |
| - moe |
| --- |
| |
| Enjoy — We extend the JiRack Models Ecosystem! 🚀 |
| - The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution. |
| - JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b |
|
|
| # JiRack Router Tokenizer Pro |
|
|
| **High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages. |
|
|
| Developed specifically for intelligent routing models and Mixture-of-Experts systems. |
|
|
|
|
| ## Model Details |
|
|
| - **Developer**: Konstantin Grabko (JiRack) |
| - **License**: Apache License 2.0 |
| - **Training Data**: Full Wikipedia multilingual dump |
| - **Vocabulary Size**: 65,536 tokens |
| - **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens) |
|
|
| ## Key Features |
|
|
| - Correctly placed `<|unk|>` token at ID 0 |
| - Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`) |
| - Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.) |
| - Support JiRack Robotics Technology with tags (`<|action_start|>` / `<|action_end|>`) |
| - Support JiRack vision , images , audio-visual |
| - Strong multilingual performance |
|
|
| # Basic system and dialogue tokens |
| "<|unk|>", |
| "<|endoftext|>", |
| "<|padding|>", |
| "<|im_start|>", |
| "<|im_end|>", |
| # Core roles |
| "<|im_start|>system", |
| "<|im_start|>user", |
| "<|im_start|>assistant", |
| # Additional useful tokens |
| "<|im_start|>tool", |
| "<|im_start|>function", |
| # Reasoning block |
| "<|im_start|>thought", |
| # Tool calls |
| "<|tool_call|>", |
| "<|tool_response|>", |
| # Multimodality and audio-visual block |
| "<|image|>", |
| "<|video|>", |
| "<|sound|>", |
| "<|voice|>", |
| "<|listening|>", |
| "<|vision|>", |
| # Emotional state (Mood) |
| "<|mood_happy|>", |
| "<|mood_sad|>", |
| "<|mood_angry|>", |
| "<|mood_neutral|>", |
| # --- FIM (Fill-in-the-Middle) tokens from StarCoder --- |
| "<|fim_prefix|>", |
| "<|fim_middle|>", |
| "<|fim_suffix|>", |
| # Robotics (trajectory/command boundaries) |
| "<|action_start|>", |
| "<|action_end|>", |
| |
| |
| # Highlights |
| - **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens. |
| - **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation. |
| - **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`). |
| - **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc. |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K") |
|
|
| text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>" |
| print(tokenizer.encode(text)) |
|
|
|
|
| ### Benchmark for tokens quality . |
|
|
|
|
| ```bash |
| === Text after ChatML Template === |
| <|im_start|>system |
| You are a precise router model.<|im_end|> |
| <|im_start|>user |
| __CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|> |
| |
| |
| === Tokens (IDs) === |
| [5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326] |
| |
| === Decoding Token by Token === |
| [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway. |
| 5 -> '<|im_start|>system' |
| 326 -> ' |
| ' |
| 5095 -> 'You' |
| 944 -> ' are' |
| 396 -> ' a' |
| 23348 -> ' precise' |
| 1021 -> ' ro' |
| 7831 -> 'uter' |
| 5869 -> ' model' |
| 141 -> '.' |
| 4 -> '<|im_end|>' |
| 326 -> ' |
| ' |
| 6 -> '<|im_start|>user' |
| 326 -> ' |
| ' |
| 29 -> '__CODING__' |
| 348 -> ' ' |
| 44 -> '__PYTHON__' |
| 26876 -> ' Write' |
| 396 -> ' a' |
| 52698 -> ' merge' |
| 6521 -> ' sort' |
| 2031 -> ' function' |
| 460 -> ' in' |
| 7524 -> ' Python' |
| 141 -> '.' |
| 4 -> '<|im_end|>' |
| 326 -> ' |
| ' |