Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Log In
Sign Up
almaghrabima
/
SARFTokenizer
like
1
Arabic
English
tokenizers
tokenizer
unigram
sarf
bilingual
arabic
english
math
code
sentencepiece-style
arxiv:
2512.18399
arxiv:
2508.08424
License:
apache-2.0
Model card
Files
Files and versions
xet
Community
main
SARFTokenizer
6.64 MB
Ctrl+K
Ctrl+K
1 contributor
History:
56 commits
almaghrabima
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
cf3eebc
verified
about 17 hours ago
.gitattributes
Safe
1.64 kB
Upload lexicons/lexicon_en.txt with huggingface_hub
3 months ago
BENCHMARK.md
1.74 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
about 17 hours ago
FAIR_BENCHMARK.md
Safe
5.67 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
3 days ago
README.md
11.8 kB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
about 17 hours ago
bench_results.json
12.5 kB
Promote v0.3.1 to main (4-domain SOTA at 100k vocab)
about 17 hours ago
benchmark_results.json
Safe
2.97 kB
v0.2: Unigram LM at 65k vocab — beats GPT-4o on AR and EN at 1/3 vocab
4 days ago
benchmark_results_2026flagships.json
Safe
883 Bytes
README: add Gemma-4, Qwen3.6, Kimi-K2.6 benchmarks + head-to-head vs flagships
4 days ago
fair_benchmark_results.json
Safe
4.8 kB
Honest OOD FineWeb benchmark: lead AR +13.5% vs GPT-4o, lose EN by 6.2%
3 days ago
special_tokens_map.json
Safe
449 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
about 17 hours ago
tokenizer.json
6.6 MB
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
about 17 hours ago
tokenizer_config.json
571 Bytes
Retrain v0.3.1 with 13 modern <|...|> special tokens (chat / code-block / tool-output boundaries; benchmark unchanged)
about 17 hours ago