almaghrabima
/

SARFTokenizer

sentencepiece-style

Model card Files Files and versions

6.64 MB

Ctrl+K

Ctrl+K

1 contributor

History: 56 commits

almaghrabima's picture

Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)

cf3eebc verified about 17 hours ago

.gitattributes

1.64 kB
Upload lexicons/lexicon_en.txt with huggingface_hub 3 months ago
BENCHMARK.md

1.74 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab) about 17 hours ago
FAIR_BENCHMARK.md

5.67 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 3 days ago
README.md

11.8 kB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) about 17 hours ago
bench_results.json

12.5 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab) about 17 hours ago
benchmark_results.json

2.97 kB
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab 4 days ago
benchmark_results_2026flagships.json

883 Bytes
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships 4 days ago
fair_benchmark_results.json

4.8 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2% 3 days ago
special_tokens_map.json

449 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) about 17 hours ago
tokenizer.json

6.6 MB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) about 17 hours ago
tokenizer_config.json

571 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged) about 17 hours ago