dataflare
/

df-arc

Model card Files Files and versions

df-arc / README.md

fr3on's picture

Update README.md

a4db065 verified about 2 months ago

|

history blame contribute delete

1.65 kB

	---
	tags:
	- arabic
	- tokenizer
	- morphology
	- nlp
	- dialect
	license: apache-2.0
	language:
	- ar
	datasets:
	- dataflare/arabic-dialect-corpus
	- dataflare/egypt-legal-corpus
	---

	# DF-Arc

	DF-Arc is a specialized Arabic tokenizer that minimizes the "Arabic Token Tax" by combining Morphological Pre-tokenization with PMI-based Phrase Merging.

	It achieves near 1:1 fertility (1.16) and high semantic density.

	## Key Highlights

	- Architecture: Unigram SentencePiece (compatible with `LlamaTokenizer`).
	- Vocab Size: 128,000 tokens.
	- Baked-in Logic: Rules for morphology (prefixes) and identity (God/Prophet names) are built into the vocabulary. No custom code needed.
	- Dialect Native: Trained on Egyptian dialogue, songs, and feedback corpora.

	## Performance

	\| Model \| Fertility \| Total Tokens \| Total Words \|
	\|-------\|-----------\|--------------\|-------------\|
	\| DF-Arc \| 1.162 \| 133,485 \| 114,882 \|
	\| GPT-4 (cl100k) \| 3.689 \| 423,743 \| 114,882 \|
	\| AraBERT v2 \| 1.555 \| 178,609 \| 114,882 \|
	\| AraT5 \| 1.193 \| 137,107 \| 114,882 \|

	## Usage

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("dataflare/df-arc")
	text = "بسم الله الرحمن الرحيم، انا بحب الذكاء الاصطناعي جدا"

	print(tokenizer.tokenize(text))
	# Output: ['ب_سم', 'الله', 'ال_رحمن', 'ال_رحيم', '،', 'انا', 'ب_حب', 'ال_ذكاء_ال_اصطناع_ي', 'جدا']
	```

	## Citation

	```bibtex
	@misc{df_arc,
	title={DF-Arc: The Arabic Token Tax & Morphology-Aware Tokenization},
	author={Dataflare Lab},
	year={2026},
	publisher={Hugging Face}
	}
	```