Enjoy โ We extend the JiRack Models Ecosystem! ๐
- The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution.
- JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b
JiRack Router Tokenizer Pro
High-performance custom Byte-Level BPE tokenizer trained on the full Wikipedia dump across multiple languages.
Developed specifically for intelligent routing models and Mixture-of-Experts systems.
Model Details
- Developer: Konstantin Grabko (JiRack)
- License: Apache License 2.0
- Training Data: Full Wikipedia multilingual dump
- Vocabulary Size: 65,536 tokens
- Special Tokens: 128 reserved tokens (including 40+ domain routing tokens)
Key Features
- Correctly placed
<|unk|>token at ID 0 - Full native support for ChatML format (
<|im_start|>/<|im_end|>) - Large set of specialized routing tokens (
__CODING__,__MATH__,__PYTHON__,__SCIENCE__, etc.) - Support JiRack Robotics Technology with tags (
<|action_start|>/<|action_end|>) - Support JiRack vision , images , audio-visual
- Strong multilingual performance
Basic system and dialogue tokens
"<|unk|>", "<|endoftext|>", "<|padding|>", "<|im_start|>", "<|im_end|>",
Core roles
"<|im_start|>system", "<|im_start|>user", "<|im_start|>assistant",
Additional useful tokens
"<|im_start|>tool", "<|im_start|>function",
Reasoning block
"<|im_start|>thought",
Tool calls
"<|tool_call|>", "<|tool_response|>",
Multimodality and audio-visual block
"<|image|>", "<|video|>", "<|sound|>", "<|voice|>", "<|listening|>", "<|vision|>",
Emotional state (Mood)
"<|mood_happy|>", "<|mood_sad|>", "<|mood_angry|>", "<|mood_neutral|>",
--- FIM (Fill-in-the-Middle) tokens from StarCoder ---
"<|fim_prefix|>", "<|fim_middle|>", "<|fim_suffix|>",
Robotics (trajectory/command boundaries)
"<|action_start|>", "<|action_end|>",
Highlights
- Architecture: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
- Form Factor: Fully wrapped into Hugging Face
PreTrainedTokenizerFastwith nativeByteLeveldecoders for clean cyrillic representation. - Chat Standard: Out-of-the-box support for ChatML formatting (
<|im_start|>/<|im_end|>). - Domain Specialization: Pre-baked atomic routing tokens like
__CODING__and__PYTHON__etc.
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
print(tokenizer.encode(text))
### Benchmark for tokens quality .
```bash
=== Text after ChatML Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>
=== Tokens (IDs) ===
[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]
=== Decoding Token by Token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5095 -> 'You'
944 -> ' are'
396 -> ' a'
23348 -> ' precise'
1021 -> ' ro'
7831 -> 'uter'
5869 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
29 -> '__CODING__'
348 -> ' '
44 -> '__PYTHON__'
26876 -> ' Write'
396 -> ' a'
52698 -> ' merge'
6521 -> ' sort'
2031 -> ' function'
460 -> ' in'
7524 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support