SARFTokenizer v0.3.1 — 4-domain (AR / EN / Math / Code) at 100k vocab

A 4-domain tokenizer at 100,000 vocabulary built on HuggingFace's Unigram LM with the AraToken-style normalization pipeline. Adds math and code to the bilingual AR/EN coverage of v0.2 without regressing Arabic — and pushes Arabic CpT to 4.004, the highest we have measured on any tokenizer at any vocab size.

Modern special-token scheme. v0.3.1 ships 13 atomic special tokens with the <|name|> convention, including chat boundaries, code-block boundaries, tool-output boundaries, and a mask token. See Special tokens below.

The headline — what we actually claim

SOTA on every domain at any published vocab tier. v0.3.1 is simultaneously the best Arabic, best English, best math, and best code tokenizer we have measured, beating GPT-4o (200k vocab) on every domain at half the vocab size.

Benchmark — 1,200-document held-out 4-domain eval

300 docs each of Arabic, English, math (FineMath-4plus), code (Nemotron-Code). 2,000-char cap per doc. add_special_tokens=False. No external preprocessing — each tokenizer's own normalizer/pre-tokenizer runs naturally.

Rank	Tokenizer	Vocab	AR	EN	MATH	CODE	Parity AR/EN
🥇	SARFTokenizer v0.3.1	100,000	4.004	3.733	4.243	4.200	1.073
2	SARFTokenizer v0.2	65,000	3.683	3.522	3.922	3.913	1.046
3	SARFTokenizer v0.3	80,000	3.192	3.631	4.259	4.224	0.879
4	Qwen3.6-35B-A3B	248,077	3.129	2.985	3.233	3.432	1.048
5	tiktoken/o200k_base (GPT-4o, GPT-5)	200,019	3.087	3.409	3.505	3.622	0.906
6	ALLaM-7B-Instruct-preview	64,000	2.854	2.518	3.000	3.250	1.133
7	google/gemma-4-31B-it	262,144	2.833	3.069	3.242	3.383	0.923
7t	google/gemma-3-1b-pt	262,145	2.833	3.069	3.242	3.384	0.923
9	google/gemma-2-2b	256,000	2.779	3.117	3.269	3.383	0.892
10	QCRI/Fanar-1-9B-Instruct	128,256	2.778	3.047	3.221	3.346	0.911
11	Qwen2.5-0.5B	151,665	2.583	2.923	3.299	3.512	0.884
12	Hala-350M	64,400	2.219	3.220	3.367	3.477	0.689
13	Kimi-K2.6	163,840	2.074	3.239	3.520	3.630	0.640
14	tiktoken/cl100k_base (GPT-4)	100,277	1.429	3.066	3.479	3.607	0.466
15	Falcon-7B	65,024	0.991	2.720	3.108	3.210	0.364

v0.3.1 vs the best peer per domain

Domain	v0.3.1	Best peer	Δ
Arabic	4.004	Qwen3.6-35B (3.129)	+27.9%
English	3.733	GPT-4o `o200k_base` (3.409)	+9.5%
Math	4.243	GPT-4o `o200k_base` (3.505)	+21.0%
Code	4.200	GPT-4o `o200k_base` (3.622)	+16.0%

v0.3.1 vs prior SARFTokenizer revisions

Domain	v0.2 (65k)	v0.3 (80k)	v0.3.1 (100k)	Δ vs v0.2
Arabic	3.683	3.192	4.004	+8.7%
English	3.522	3.631	3.733	+6.0%
Math	3.922	4.259	4.243	+8.2%
Code	3.913	4.224	4.200	+7.3%

The 100k vocab gives Arabic ~50,000 effective slots (vs v0.2's 32,500 at 65k), and the 250M-char Arabic training share matches v0.2 exactly — so AR strictly gains from the larger vocab while math/code retain v0.3-class compression.

Why this matters

Arabic-first deployments: 4.004 AR CpT means ~30% more Arabic context in the same window vs GPT-4o, ~9% more vs our own v0.2.
Bilingual + technical domains: math and code now first-class — strong compression on Python, math word problems, and formal reasoning chains.
Vocab specialization > vocab size: at 100k we beat models with 200k–262k vocabularies on every domain.
Same infrastructure: AutoTokenizer.from_pretrained without trust_remote_code, no Python preprocessing.
Modern special tokens: chat / code-block / tool-output boundaries baked into atomic IDs from training time — no post-hoc resize needed.

Caveats we want you to know

Lossy Arabic normalization (inherited from v0.2). Tashkeel, Alef variants, Ya Maksura, and Indic digits are normalized at encode time. Not suitable for Qur'anic text or classical poetry with full diacritics.
Math is web-style. Trained on FineMath-4plus — natural-language math web text, not LaTeX-heavy formal mathematics.
Code is Python-leaning. Trained on Nemotron-Code, dominated by Python competitive-programming solutions with <think> reasoning. Less common languages may fall back to byte-level pieces more often.
Larger embedding table. 100k × hidden_dim is ~50% bigger than the v0.2 65k row table. Worth it if you can afford the parameters; if not, see v0.2 (AR/EN only) or v0.3 (4-domain at 80k with AR regression).
Breaking change vs v0.2/v0.3 special tokens. Old <s> / </s> / <unk> / <pad> are no longer present. Pin revision="v0.2" if you depend on the old token IDs.

Special tokens

13 atomic special tokens with reserved IDs 0–12 (single-token, never split):

ID	Token	Slot	Purpose
0	`<\|assistant_end\|>`	additional	end of assistant turn (chat)
1	`<\|assistant_start\|>`	additional	start of assistant turn (chat)
2	`<\|bos\|>`	bos_token	beginning-of-sequence
3	`<\|end_of_text\|>`	eos_token	end-of-sequence
4	`<\|mask\|>`	mask_token	mask for FIM / denoising / infilling
5	`<\|output_end\|>`	additional	end of tool / exec output block
6	`<\|output_start\|>`	additional	start of tool / exec output block
7	`<\|pad\|>`	pad_token	padding
8	`<\|python_end\|>`	additional	end of Python code block
9	`<\|python_start\|>`	additional	start of Python code block
10	`<\|unk\|>`	unk_token	unknown / byte-fallback signal
11	`<\|user_end\|>`	additional	end of user turn (chat)
12	`<\|user_start\|>`	additional	start of user turn (chat)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.bos_token, tok.eos_token, tok.mask_token)
# → <|bos|>  <|end_of_text|>  <|mask|>

The chat / code / output tokens enable a downstream model to emit:

<|user_start|>solve x^2 + 3x = 10<|user_end|>
<|assistant_start|>
<|python_start|>
from sympy import symbols, solve
x = symbols('x')
print(solve(x**2 + 3*x - 10))
<|python_end|>
<|output_start|>
[-5, 2]
<|output_end|>
The roots are x = -5 and x = 2.
<|assistant_end|>
<|end_of_text|>

without any markup-tokenization overhead — every boundary is a single token.

Overview

Property	Value
Model	Unigram LM (HuggingFace `tokenizers.models.Unigram`)
Vocabulary size	100,000
Pre-tokenizer	Metaspace (`▁` marker, SentencePiece-style)
Normalizer	AraToken-style: NFKC → Alef/Ya unification → tashkeel/tatweel/zero-width strip → Indic digits → ASCII
Special tokens	13 (see table above)
Domains	Arabic + English + Math + Code
Training corpus	500M chars (250 AR / 100 EN / 75 math / 75 code)
Training corpus repo	`almaghrabima/deeplatent-labeled`
Public API	`AutoTokenizer.from_pretrained` without `trust_remote_code`

Quick start

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer")
print(tok.vocab_size)  # 100000

To pin to a specific revision:

# v0.3.1 (latest, 100k, 4-domain, modern specials, this revision)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.3.1")

# v0.3 (80k, 4-domain, legacy <s>/</s> specials — accepts AR regression for smaller vocab)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.3")

# v0.2 (65k, AR/EN only, legacy specials — original SOTA-Arabic release)
tok = AutoTokenizer.from_pretrained("almaghrabima/SARFTokenizer", revision="v0.2")

Low-level `tokenizers` API

from tokenizers import Tokenizer
tok = Tokenizer.from_pretrained("almaghrabima/SARFTokenizer")  # main = v0.3.1

print(tok.encode("المعلم يشرح الدرس في الصف اليوم.", add_special_tokens=False).tokens)
print(tok.encode("def fib(n):\n    return n if n<2 else fib(n-1)+fib(n-2)",
                 add_special_tokens=False).tokens)

Reproduce the benchmark

The eval set (300 AR + 300 EN + 300 math + 300 code) is built from:

AR/EN: the SARFTokenizer-benchmark-eval dataset.
Math: held-out tail of HuggingFaceTB/finemath (finemath-4plus).
Code: held-out tail of saurabh5/nemotron-post-training-dataset-v1-code with role markers stripped (problem + solution flattened with \n\n).

Each doc capped at 2000 chars, no normalization beyond what each tokenizer applies internally.

Normalization (lossy on Arabic, by design)

All Arabic text is normalized at encode time:

NFKC compat normalization
Tashkeel (U+064B–U+0652, U+0670) removed
Tatweel U+0640 removed
Zero-width + BiDi controls removed
Alef variants (أ, إ, آ, ٱ) → bare Alef ا
Alef Maksura ى → Ya ي
Arabic-Indic digits (٠–٩) → ASCII 0–9

Encoding is lossy on diacritics and Alef-Hamza variants — by design. If your downstream task requires preserving these (classical poetry with full diacritics, Qur'anic text), this tokenizer is not suitable.

Why Unigram?

Recent literature (AraToken arXiv:2512.18399; "Rethinking Tokenization for Rich Morphology" arXiv:2508.08424) finds Unigram LM outperforms BPE on morphologically rich languages because its EM-based pruning recovers morphology implicitly — no explicit morpheme preprocessing required. v0.3.1 confirms the finding extends to math and code: at 100k vocab the model discovers strong multi-character pieces for AR words, EN tokens, math notation, and Python identifiers in a single homogeneous lattice.

Files

tokenizer.json — HuggingFace-format tokenizer (6.6 MB)
tokenizer_config.json — PreTrainedTokenizerFast config
special_tokens_map.json — special tokens map (5 named slots + 13-item additional)
BENCHMARK.md — full results across 15 tokenizers (this README's table)
bench_results.json — raw per-tokenizer per-domain metrics

Training corpus: almaghrabima/deeplatent-labeled — 4-domain labeled pretraining corpus
Eval corpus (AR/EN portion): almaghrabima/SARFTokenizer-benchmark-eval — 300 AR + 300 EN held-out documents

Version history

v0.3.1 (latest, this revision) — 100k vocab, 4-domain, 13 modern <|...|> specials. SOTA on AR/EN/math/code.
v0.3 — 80k vocab, 4-domain, legacy <s> / </s> / <unk> / <pad> specials. Math/code SOTA but AR regresses vs v0.2.
v0.2 — 65k vocab, AR/EN only, legacy specials. Original release; SOTA Arabic at sub-100k tier.

License

Apache 2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for almaghrabima/SARFTokenizer

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Paper • 2512.18399 • Published Dec 20, 2025

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Paper • 2508.08424 • Published Nov 10, 2025

almaghrabima
/

SARFTokenizer

SARFTokenizer v0.3.1 — 4-domain (AR / EN / Math / Code) at 100k vocab

The headline — what we actually claim

Benchmark — 1,200-document held-out 4-domain eval

v0.3.1 vs the best peer per domain

v0.3.1 vs prior SARFTokenizer revisions

Why this matters

Caveats we want you to know

Special tokens

Overview

Quick start

Low-level `tokenizers` API

Reproduce the benchmark

Normalization (lossy on Arabic, by design)

Why Unigram?

Files

Related

Version history

License

Papers for almaghrabima/SARFTokenizer

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

SARFTokenizer v0.3.1 — 4-domain (AR / EN / Math / Code) at 100k vocab

The headline — what we actually claim

Benchmark — 1,200-document held-out 4-domain eval

v0.3.1 vs the best peer per domain

v0.3.1 vs prior SARFTokenizer revisions

Why this matters

Caveats we want you to know

Special tokens

Overview

Quick start

Low-level tokenizers API

Reproduce the benchmark

Normalization (lossy on Arabic, by design)

Why Unigram?

Files

Related

Version history

License

Papers for almaghrabima/SARFTokenizer

Low-level `tokenizers` API