| --- |
| tags: |
| - arabic |
| - tokenizer |
| - morphology |
| - nlp |
| - dialect |
| license: apache-2.0 |
| language: |
| - ar |
| datasets: |
| - dataflare/arabic-dialect-corpus |
| - dataflare/egypt-legal-corpus |
| --- |
| |
| # DF-Arc |
|
|
| **DF-Arc** is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining **Morphological Pre-tokenization** with **PMI-based Phrase Merging**. |
|
|
| It achieves near 1:1 fertility (1.16) and high semantic density. |
|
|
| ## Key Highlights |
|
|
| - **Architecture**: Unigram SentencePiece (compatible with `LlamaTokenizer`). |
| - **Vocab Size**: 128,000 tokens. |
| - **Baked-in Logic**: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed. |
| - **Dialect Native**: Trained on Egyptian dialogue, songs, and feedback corpora. |
|
|
| ## Performance |
|
|
| | Model | Fertility | Total Tokens | Total Words | |
| |-------|-----------|--------------|-------------| |
| | DF-Arc | 1.162 | 133,485 | 114,882 | |
| | GPT-4 (cl100k) | 3.689 | 423,743 | 114,882 | |
| | AraBERT v2 | 1.555 | 178,609 | 114,882 | |
| | AraT5 | 1.193 | 137,107 | 114,882 | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc") |
| text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا" |
| |
| print(tokenizer.tokenize(text)) |
| # Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا'] |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{df_arc, |
| title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization}, |
| author={Dataflare Lab}, |
| year={2026}, |
| publisher={Hugging Face} |
| } |
| ``` |