kgrabko's picture
Update README.md
26b4ea5 verified
---
license: apache-2.0
language:
- multilingual
tags:
- tokenizer
- bpe
- byte-level-bpe
- chatml
- routing
- moe
---
Enjoy — We extend the JiRack Models Ecosystem! 🚀
- The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution.
- JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b
# JiRack Router Tokenizer Pro
**High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages.
Developed specifically for intelligent routing models and Mixture-of-Experts systems.
## Model Details
- **Developer**: Konstantin Grabko (JiRack)
- **License**: Apache License 2.0
- **Training Data**: Full Wikipedia multilingual dump
- **Vocabulary Size**: 65,536 tokens
- **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens)
## Key Features
- Correctly placed `<|unk|>` token at ID 0
- Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`)
- Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.)
- Support JiRack Robotics Technology with tags (`<|action_start|>` / `<|action_end|>`)
- Support JiRack vision , images , audio-visual
- Strong multilingual performance
# Basic system and dialogue tokens
"<|unk|>",
"<|endoftext|>",
"<|padding|>",
"<|im_start|>",
"<|im_end|>",
# Core roles
"<|im_start|>system",
"<|im_start|>user",
"<|im_start|>assistant",
# Additional useful tokens
"<|im_start|>tool",
"<|im_start|>function",
# Reasoning block
"<|im_start|>thought",
# Tool calls
"<|tool_call|>",
"<|tool_response|>",
# Multimodality and audio-visual block
"<|image|>",
"<|video|>",
"<|sound|>",
"<|voice|>",
"<|listening|>",
"<|vision|>",
# Emotional state (Mood)
"<|mood_happy|>",
"<|mood_sad|>",
"<|mood_angry|>",
"<|mood_neutral|>",
# --- FIM (Fill-in-the-Middle) tokens from StarCoder ---
"<|fim_prefix|>",
"<|fim_middle|>",
"<|fim_suffix|>",
# Robotics (trajectory/command boundaries)
"<|action_start|>",
"<|action_end|>",
# Highlights
- **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
- **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation.
- **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`).
- **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc.
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
print(tokenizer.encode(text))
### Benchmark for tokens quality .
```bash
=== Text after ChatML Template ===
<|im_start|>system
You are a precise router model.<|im_end|>
<|im_start|>user
__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>
=== Tokens (IDs) ===
[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]
=== Decoding Token by Token ===
[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
5 -> '<|im_start|>system'
326 -> '
'
5095 -> 'You'
944 -> ' are'
396 -> ' a'
23348 -> ' precise'
1021 -> ' ro'
7831 -> 'uter'
5869 -> ' model'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'
6 -> '<|im_start|>user'
326 -> '
'
29 -> '__CODING__'
348 -> ' '
44 -> '__PYTHON__'
26876 -> ' Write'
396 -> ' a'
52698 -> ' merge'
6521 -> ' sort'
2031 -> ' function'
460 -> ' in'
7524 -> ' Python'
141 -> '.'
4 -> '<|im_end|>'
326 -> '
'