kgrabko commited on
Commit
26b4ea5
·
verified ·
1 Parent(s): 2b89f87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -1
README.md CHANGED
@@ -92,4 +92,53 @@ from transformers import AutoTokenizer
92
  tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
93
 
94
  text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
95
- print(tokenizer.encode(text))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
93
 
94
  text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
95
+ print(tokenizer.encode(text))
96
+
97
+
98
+ ### Benchmark for tokens quality .
99
+
100
+
101
+ ```bash
102
+ === Text after ChatML Template ===
103
+ <|im_start|>system
104
+ You are a precise router model.<|im_end|>
105
+ <|im_start|>user
106
+ __CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>
107
+
108
+
109
+ === Tokens (IDs) ===
110
+ [5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]
111
+
112
+ === Decoding Token by Token ===
113
+ [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
114
+ 5 -> '<|im_start|>system'
115
+ 326 -> '
116
+ '
117
+ 5095 -> 'You'
118
+ 944 -> ' are'
119
+ 396 -> ' a'
120
+ 23348 -> ' precise'
121
+ 1021 -> ' ro'
122
+ 7831 -> 'uter'
123
+ 5869 -> ' model'
124
+ 141 -> '.'
125
+ 4 -> '<|im_end|>'
126
+ 326 -> '
127
+ '
128
+ 6 -> '<|im_start|>user'
129
+ 326 -> '
130
+ '
131
+ 29 -> '__CODING__'
132
+ 348 -> ' '
133
+ 44 -> '__PYTHON__'
134
+ 26876 -> ' Write'
135
+ 396 -> ' a'
136
+ 52698 -> ' merge'
137
+ 6521 -> ' sort'
138
+ 2031 -> ' function'
139
+ 460 -> ' in'
140
+ 7524 -> ' Python'
141
+ 141 -> '.'
142
+ 4 -> '<|im_end|>'
143
+ 326 -> '
144
+ '