File size: 4,072 Bytes
cc2b114
 
 
 
 
 
 
 
 
 
 
 
 
17351fd
6955724
b3378a2
3429aeb
cc2b114
 
 
 
 
 
3429aeb
cc2b114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e0bb7b
 
 
cc2b114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b89f87
cc2b114
aaf244f
26b4ea5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
license: apache-2.0
language:
  - multilingual
tags:
  - tokenizer
  - bpe
  - byte-level-bpe
  - chatml
  - routing
  - moe
---

Enjoy — We extend the JiRack Models Ecosystem! 🚀
- The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution.
- JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b

# JiRack Router Tokenizer Pro

**High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages.

Developed specifically for intelligent routing models and Mixture-of-Experts systems.


## Model Details

- **Developer**: Konstantin Grabko (JiRack)
- **License**: Apache License 2.0
- **Training Data**: Full Wikipedia multilingual dump
- **Vocabulary Size**: 65,536 tokens
- **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens)

## Key Features

- Correctly placed `<|unk|>` token at ID 0
- Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`)
- Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.)
- Support JiRack Robotics Technology with tags  (`<|action_start|>` / `<|action_end|>`)
- Support JiRack vision , images , audio-visual
- Strong multilingual performance

# Basic system and dialogue tokens
"<|unk|>",
"<|endoftext|>",
"<|padding|>",
"<|im_start|>",
"<|im_end|>",
# Core roles
"<|im_start|>system",
"<|im_start|>user",
"<|im_start|>assistant",
# Additional useful tokens
"<|im_start|>tool",
"<|im_start|>function",
# Reasoning block
"<|im_start|>thought",
# Tool calls
"<|tool_call|>",
"<|tool_response|>",
# Multimodality and audio-visual block
"<|image|>",
"<|video|>",
"<|sound|>",
"<|voice|>",
"<|listening|>",
"<|vision|>",
# Emotional state (Mood)
"<|mood_happy|>",
"<|mood_sad|>",
"<|mood_angry|>",
"<|mood_neutral|>",
# --- FIM (Fill-in-the-Middle) tokens from StarCoder ---
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
# Robotics (trajectory/command boundaries)
"<|action_start|>",
"<|action_end|>",


# Highlights
- **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
- **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation.
- **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`).
- **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc.

## Usage

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")

text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
print(tokenizer.encode(text))


### Benchmark for tokens  quality .


```bash
=== Text after ChatML Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>


=== Tokens (IDs) ===
[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]

=== Decoding Token by Token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5095 -> 'You'
944 -> ' are'
396 -> ' a'
23348 -> ' precise'
1021 -> ' ro'
7831 -> 'uter'
5869 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
29 -> '__CODING__'
348 -> ' '
44 -> '__PYTHON__'
26876 -> ' Write'
396 -> ' a'
52698 -> ' merge'
6521 -> ' sort'
2031 -> ' function'
460 -> ' in'
7524 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'