kgrabko commited on
Commit
cc2b114
·
verified ·
1 Parent(s): 73aff7a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - multilingual
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - byte-level-bpe
9
+ - chatml
10
+ - routing
11
+ - moe
12
+ ---
13
+
14
+ # JiRack Router Tokenizer Pro
15
+
16
+ **High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages.
17
+
18
+ Developed specifically for intelligent routing models and Mixture-of-Experts systems.
19
+
20
+ ## Model Details
21
+
22
+ - **Developer**: Konstantin Grabko (JiRack)
23
+ - **License**: Apache License 2.0
24
+ - **Training Data**: Full Wikipedia multilingual dump
25
+ - **Vocabulary Size**: 65,536 tokens
26
+ - **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens)
27
+
28
+ ## Key Features
29
+
30
+ - Correctly placed `<|unk|>` token at ID 0
31
+ - Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`)
32
+ - Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.)
33
+ - Support JiRack Robotics Technology with tags (`<|action_start|>` / `<|action_end|>`)
34
+ - Support JiRack vision , images , audio-visual
35
+ - Strong multilingual performance
36
+
37
+ # Basic system and dialogue tokens
38
+ "<|unk|>",
39
+ "<|endoftext|>",
40
+ "<|padding|>",
41
+ "<|im_start|>",
42
+ "<|im_end|>",
43
+ # Core roles
44
+ "<|im_start|>system",
45
+ "<|im_start|>user",
46
+ "<|im_start|>assistant",
47
+ # Additional useful tokens
48
+ "<|im_start|>tool",
49
+ "<|im_start|>function",
50
+ # Reasoning block
51
+ "<|im_start|>thought",
52
+ # Tool calls
53
+ "<|tool_call|>",
54
+ "<|tool_response|>",
55
+ # Multimodality and audio-visual block
56
+ "<|image|>",
57
+ "<|video|>",
58
+ "<|sound|>",
59
+ "<|voice|>",
60
+ "<|listening|>",
61
+ "<|vision|>",
62
+ # Emotional state (Mood)
63
+ "<|mood_happy|>",
64
+ "<|mood_sad|>",
65
+ "<|mood_angry|>",
66
+ "<|mood_neutral|>",
67
+ # --- FIM (Fill-in-the-Middle) tokens from StarCoder ---
68
+ "<fim_prefix>",
69
+ "<fim_middle>",
70
+ "<fim_suffix>",
71
+ # Robotics (trajectory/command boundaries)
72
+ "<|action_start|>",
73
+ "<|action_end|>",
74
+
75
+
76
+ # Highlights
77
+ - **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens.
78
+ - **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation.
79
+ - **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`).
80
+ - **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc.
81
+
82
+ ## Usage
83
+
84
+ ```python
85
+ from transformers import AutoTokenizer
86
+
87
+ tokenizer = AutoTokenizer.from_pretrained("your-username/jirack_router_tokenizer")
88
+
89
+ text = "<|im_start|>user\n__CODING__ __PYTHON__ Напиши функцию сортировки слиянием.<|im_end|>"
90
+ print(tokenizer.encode(text))