--- license: apache-2.0 language: - multilingual tags: - tokenizer - bpe - byte-level-bpe - chatml - routing - moe --- Enjoy — We extend the JiRack Models Ecosystem! 🚀 - The Best RAG with Mixture-of-Experts via CMS Manhattan RAG System with JiRack Ternary Experts for CPU Solution. - JiRackTernary_1b model https://huggingface.co/kgrabko/JiRackTernary_1b # JiRack Router Tokenizer Pro **High-performance custom Byte-Level BPE tokenizer** trained on the full Wikipedia dump across multiple languages. Developed specifically for intelligent routing models and Mixture-of-Experts systems. ## Model Details - **Developer**: Konstantin Grabko (JiRack) - **License**: Apache License 2.0 - **Training Data**: Full Wikipedia multilingual dump - **Vocabulary Size**: 65,536 tokens - **Special Tokens**: 128 reserved tokens (including 40+ domain routing tokens) ## Key Features - Correctly placed `<|unk|>` token at ID 0 - Full native support for **ChatML** format (`<|im_start|>` / `<|im_end|>`) - Large set of specialized routing tokens (`__CODING__`, `__MATH__`, `__PYTHON__`, `__SCIENCE__`, etc.) - Support JiRack Robotics Technology with tags (`<|action_start|>` / `<|action_end|>`) - Support JiRack vision , images , audio-visual - Strong multilingual performance # Basic system and dialogue tokens "<|unk|>", "<|endoftext|>", "<|padding|>", "<|im_start|>", "<|im_end|>", # Core roles "<|im_start|>system", "<|im_start|>user", "<|im_start|>assistant", # Additional useful tokens "<|im_start|>tool", "<|im_start|>function", # Reasoning block "<|im_start|>thought", # Tool calls "<|tool_call|>", "<|tool_response|>", # Multimodality and audio-visual block "<|image|>", "<|video|>", "<|sound|>", "<|voice|>", "<|listening|>", "<|vision|>", # Emotional state (Mood) "<|mood_happy|>", "<|mood_sad|>", "<|mood_angry|>", "<|mood_neutral|>", # --- FIM (Fill-in-the-Middle) tokens from StarCoder --- "<|fim_prefix|>", "<|fim_middle|>", "<|fim_suffix|>", # Robotics (trajectory/command boundaries) "<|action_start|>", "<|action_end|>", # Highlights - **Architecture**: Byte-Level BPE (Byte-level Byte Pair Encoding) which natively prevents Out-Of-Vocabulary (OOV) tokens. - **Form Factor**: Fully wrapped into Hugging Face `PreTrainedTokenizerFast` with native `ByteLevel` decoders for clean cyrillic representation. - **Chat Standard**: Out-of-the-box support for **ChatML** formatting (`<|im_start|>` / `<|im_end|>`). - **Domain Specialization**: Pre-baked atomic routing tokens like `__CODING__` and `__PYTHON__` etc. ## Usage ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("CMSManhattan/JiRack-Router-Tokenizer-65K") text = "<|im_start|>user\n__CODING__ __PYTHON__ Write a merge sort function.<|im_end|>" print(tokenizer.encode(text)) ### Benchmark for tokens quality . ```bash === Text after ChatML Template === <|im_start|>system You are a precise router model.<|im_end|> <|im_start|>user __CODING__ __PYTHON__ Write a merge sort function in Python.<|im_end|> === Tokens (IDs) === [5, 326, 5095, 944, 396, 23348, 1021, 7831, 5869, 141, 4, 326, 6, 326, 29, 348, 44, 26876, 396, 52698, 6521, 2031, 460, 7524, 141, 4, 326] === Decoding Token by Token === [transformers] Ignoring clean_up_tokenization_spaces=True for BPE tokenizer TokenizersBackend. The clean_up_tokenization post-processing step is designed for WordPiece tokenizers and is destructive for BPE (it strips spaces before punctuation). Set clean_up_tokenization_spaces=False to suppress this warning, or set clean_up_tokenization_spaces_for_bpe_even_though_it_will_corrupt_output=True to force cleanup anyway. 5 -> '<|im_start|>system' 326 -> ' ' 5095 -> 'You' 944 -> ' are' 396 -> ' a' 23348 -> ' precise' 1021 -> ' ro' 7831 -> 'uter' 5869 -> ' model' 141 -> '.' 4 -> '<|im_end|>' 326 -> ' ' 6 -> '<|im_start|>user' 326 -> ' ' 29 -> '__CODING__' 348 -> ' ' 44 -> '__PYTHON__' 26876 -> ' Write' 396 -> ' a' 52698 -> ' merge' 6521 -> ' sort' 2031 -> ' function' 460 -> ' in' 7524 -> ' Python' 141 -> '.' 4 -> '<|im_end|>' 326 -> ' '