CMSManhattan
/

JiRack-Router-Tokenizer-65K

Mixture of Experts

Model card Files Files and versions

kgrabko commited on about 10 hours ago

Commit

26b4ea5

·

verified ·

1 Parent(s): 2b89f87

Update README.md

Files changed (1) hide show

README.md +50 -1

README.md CHANGED Viewed

@@ -92,4 +92,53 @@ from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
 text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
-print(tokenizer.encode(text))

 tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K")
 text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>"
+print(tokenizer.encode(text))
+### Benchmark for tokens  quality .
+```bash
+=== Text after ChatML Template ===
+<|im_start|>system
+You are a precise router model.<|im_end|>
+<|im_start|>user
+__CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|>
+=== Tokens (IDs) ===
+[5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326]
+=== Decoding Token by Token ===
+[transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway.
+5 -> '<|im_start|>system'
+326 -> '
+'
+5095 -> 'You'
+944 -> ' are'
+396 -> ' a'
+23348 -> ' precise'
+1021 -> ' ro'
+7831 -> 'uter'
+5869 -> ' model'
+141 -> '.'
+4 -> '<|im_end|>'
+326 -> '
+'
+6 -> '<|im_start|>user'
+326 -> '
+'
+29 -> '__CODING__'
+348 -> ' '
+44 -> '__PYTHON__'
+26876 -> ' Write'
+396 -> ' a'
+52698 -> ' merge'
+6521 -> ' sort'
+2031 -> ' function'
+460 -> ' in'
+7524 -> ' Python'
+141 -> '.'
+4 -> '<|im_end|>'
+326 -> '
+'