Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- multilingual
|
| 5 |
+
tags:
|
| 6 |
+
- tokenizer
|
| 7 |
+
- bpe
|
| 8 |
+
- byte-level-bpe
|
| 9 |
+
- chatml
|
| 10 |
+
- routing
|
| 11 |
+
- moe
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# JiRack Router Tokenizer Pro
|
| 15 |
+
|
| 16 |
+
**High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages.
|
| 17 |
+
|
| 18 |
+
Developed specifically for intelligent routing models and Mixture-of-Experts systems.
|
| 19 |
+
|
| 20 |
+
## Model Details
|
| 21 |
+
|
| 22 |
+
- **Developer**: Konstantin Grabko (JiRack)
|
| 23 |
+
- **License**: Apache License 2.0
|
| 24 |
+
- **Training Data**: Full Wikipedia multilingual dump
|
| 25 |
+
- **Vocabulary Size**: 65,536 tokens
|
| 26 |
+
- **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens)
|
| 27 |
+
|
| 28 |
+
## Key Features
|
| 29 |
+
|
| 30 |
+
- Correctly placed `<|unk|>` token at ID 0
|
| 31 |
+
- Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`)
|
| 32 |
+
- Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.)
|
| 33 |
+
- Support JiRack Robotics Technology with tags (`<|action_start|>` / `<|action_end|>`)
|
| 34 |
+
- Support JiRack vision , images , audio-visual
|
| 35 |
+
- Strong multilingual performance
|
| 36 |
+
|
| 37 |
+
# Basic system and dialogue tokens
|
| 38 |
+
"<|unk|>",
|
| 39 |
+
"<|endoftext|>",
|
| 40 |
+
"<|padding|>",
|
| 41 |
+
"<|im_start|>",
|
| 42 |
+
"<|im_end|>",
|
| 43 |
+
# Core roles
|
| 44 |
+
"<|im_start|>system",
|
| 45 |
+
"<|im_start|>user",
|
| 46 |
+
"<|im_start|>assistant",
|
| 47 |
+
# Additional useful tokens
|
| 48 |
+
"<|im_start|>tool",
|
| 49 |
+
"<|im_start|>function",
|
| 50 |
+
# Reasoning block
|
| 51 |
+
"<|im_start|>thought",
|
| 52 |
+
# Tool calls
|
| 53 |
+
"<|tool_call|>",
|
| 54 |
+
"<|tool_response|>",
|
| 55 |
+
# Multimodality and audio-visual block
|
| 56 |
+
"<|image|>",
|
| 57 |
+
"<|video|>",
|
| 58 |
+
"<|sound|>",
|
| 59 |
+
"<|voice|>",
|
| 60 |
+
"<|listening|>",
|
| 61 |
+
"<|vision|>",
|
| 62 |
+
# Emotional state (Mood)
|
| 63 |
+
"<|mood_happy|>",
|
| 64 |
+
"<|mood_sad|>",
|
| 65 |
+
"<|mood_angry|>",
|
| 66 |
+
"<|mood_neutral|>",
|
| 67 |
+
# --- FIM (Fill-in-the-Middle) tokens from StarCoder ---
|
| 68 |
+
"<fim_prefix>",
|
| 69 |
+
"<fim_middle>",
|
| 70 |
+
"<fim_suffix>",
|
| 71 |
+
# Robotics (trajectory/command boundaries)
|
| 72 |
+
"<|action_start|>",
|
| 73 |
+
"<|action_end|>",
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
# Highlights
|
| 77 |
+
- **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
|
| 78 |
+
- **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation.
|
| 79 |
+
- **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`).
|
| 80 |
+
- **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc.
|
| 81 |
+
|
| 82 |
+
## Usage
|
| 83 |
+
|
| 84 |
+
```python
|
| 85 |
+
from transformers import AutoTokenizer
|
| 86 |
+
|
| 87 |
+
tokenizer = AutoTokenizer.from_pretrained("your-username/jirack_router_tokenizer")
|
| 88 |
+
|
| 89 |
+
text = "<|im_start|>user\n__CODING__ __PYTHON__ Напиши функцию сортировки слиянием.<|im_end|>"
|
| 90 |
+
print(tokenizer.encode(text))
|