File size: 4,072 Bytes
cc2b114 17351fd 6955724 b3378a2 3429aeb cc2b114 3429aeb cc2b114 1e0bb7b cc2b114 2b89f87 cc2b114 aaf244f 26b4ea5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | ---
license: apache-2.0
language:
- multilingual
tags:
- tokenizer
- bpe
- byte-level-bpe
- chatml
- routing
- moe
---
Enjoy — We extend the JiRack Models Ecosystem! 🚀
- The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution.
- JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b
# JiRack Router Tokenizer Pro
**High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages.
Developed specifically for intelligent routing models and Mixture-of-Experts systems.
## Model Details
- **Developer**: Konstantin Grabko (JiRack)
- **License**: Apache License 2.0
- **Training Data**: Full Wikipedia multilingual dump
- **Vocabulary Size**: 65,536 tokens
- **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens)
## Key Features
- Correctly placed `<|unk|>` token at ID 0
- Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`)
- Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.)
- Support JiRack Robotics Technology with tags (`<|action_start|>` / `<|action_end|>`)
- Support JiRack vision , images , audio-visual
- Strong multilingual performance
# Basic system and dialogue tokens
"<|unk|>",
"<|endoftext|>",
"<|padding|>",
"<|im_start|>",
"<|im_end|>",
# Core roles
"<|im_start|>system",
"<|im_start|>user",
"<|im_start|>assistant",
# Additional useful tokens
"<|im_start|>tool",
"<|im_start|>function",
# Reasoning block
"<|im_start|>thought",
# Tool calls
"<|tool_call|>",
"<|tool_response|>",
# Multimodality and audio-visual block
"<|image|>",
"<|video|>",
"<|sound|>",
"<|voice|>",
"<|listening|>",
"<|vision|>",
# Emotional state (Mood)
"<|mood_happy|>",
"<|mood_sad|>",
"<|mood_angry|>",
"<|mood_neutral|>",
# --- FIM (Fill-in-the-Middle) tokens from StarCoder ---
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
# Robotics (trajectory/command boundaries)
"<|action_start|>",
"<|action_end|>",
# Highlights
- **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
- **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation.
- **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`).
- **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc.
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
print(tokenizer.encode(text))
### Benchmark for tokens quality .
```bash
=== Text after ChatML Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>
=== Tokens (IDs) ===
[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]
=== Decoding Token by Token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5095 -> 'You'
944 -> ' are'
396 -> ' a'
23348 -> ' precise'
1021 -> ' ro'
7831 -> 'uter'
5869 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
29 -> '__CODING__'
348 -> ' '
44 -> '__PYTHON__'
26876 -> ' Write'
396 -> ' a'
52698 -> ' merge'
6521 -> ' sort'
2031 -> ' function'
460 -> ' in'
7524 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
' |