Traum Tokenizer
Traum Tokenizer is a high-performance, specialized tokenizer designed for next-generation Large Language Models (LLMs) and specifically optimized for the Flash - SLM project. Developed after extensive research into existing tokenizers like GPT-2 and BERT, Traum Tokenizer addresses the critical need for a balanced approach between compression efficiency, training speed, and linguistic understanding.
Overview
A tokenizer's efficiency is paramount to a model's performance. Traum Tokenizer utilizes a Byte-Level BPE (Byte-Pair Encoding) algorithm, which ensures that no unknown or encoding error tokens are produced, making it robust across diverse text types.
Key Features
- Massive Training Scale: Trained on a diverse dataset of 20 billion tokens.
- Expanded Vocabulary: Features a vocabulary size larger than GPT-2 by over 15,000 tokens, allowing for better representation of complex and modern terminology.
- Precision Engineering: Optimized for reasoning, mathematical symbols, and structural code.
- Optimized for Efficiency: Designed to maximize training throughput and inference quality for Small Language Models (SLMs).
Performance Benchmarks
Traum Tokenizer has been benchmarked against GPT-2 and LLaMA tokenizers across multiple domains. The performance metrics focus on the compression ratio (Characters per Token), where higher values indicate more efficient tokenization.
| Benchmark Category | Traum Tokenizer | GPT-2 Tokenizer | LLaMA Tokenizer |
|---|---|---|---|
| English Text | 2.80 | 2.80 | 2.33 |
| Mathematical Logic | 1.00 | 1.00 | 0.83 |
| Code Syntax | 2.57 | 2.57 | 2.57 |
| Chain-of-Thought (CoT) | 7.00 | 3.50 | 3.11 |
Benchmark Analysis
- English: Traum outperforms the LLaMA tokenizer and establishes a performance profile comparable to the industry-standard GPT-2.
- Mathematics: Traum shows superior tokenization efficiency compared to both GPT-2 and LLaMA, capturing mathematical structures with high precision.
- Code: Performance is consistent and equal with current state-of-the-art tokenizers.
- Reasoning (CoT): The current version exhibits extremely high compression in reasoning tasks (7.00 chars/token). While highly efficient, future iterations (Traum v2) will focus on fine-tuning this compression to further enhance linguistic nuances in dense reasoning chains.
Visual Comparison
The chart below visualizes the comparative efficiency of Traum Tokenizer across different test sets.
Future Development
Traum Tokenizer is the foundational component for a series of upcoming open-source AI models designed for high-efficiency reasoning. These models will be released on the official account. Based on community interest and feedback, the tokenizer architecture may be fully open-sourced for broad use in the future.
Usage
Load the tokenizer via the Hugging Face Transformers library:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("assemsabry/traum-tokenizer")
# Example usage
text = "The quick brown fox jumps over the lazy dog."
tokens = tokenizer.encode(text)
print(f"Encoded tokens: {tokens}")
print(f"Decoded text: {tokenizer.decode(tokens)}")
Repository Structure
tokenizer.json: Core BPE tokenizer configuration and vocabulary.tokenizer_config.json: Metadata and configuration for the Transformers/Tokenizers library.traum_chart.png: Benchmark visualization.README.md: System documentation and benchmarks.
Developer
Assem Sabry is an Egyptian AI Engineer & Researcher and the founder of Token AI (founded in 2025).
- Website: https://assem.cloud/
- LinkedIn: https://www.linkedin.com/in/assem7/
