NeoAraBERT

NeoAraBERT is a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pretrain NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a synonym-based task, Muradif, that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants rank first in 18 tasks and improve average performance across the full benchmark suite.

This is the NeoAraBERT_Mix checkpoint, our best-performing checkpoint overall. This model was introduced at the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026). For more information, visit our website: https://acr.ps/neoarabert.

The available NeoAraBERT checkpoints:

Model Description Link
NeoAraBERT (NeoAraBERT_Mix) Trained on both Modern Standard Arabic and Dialectal Arabic. this repository ✅
NeoAraBERT_MSA Trained on Modern Standard Arabic. link
NeoAraBERT_DA Trained on Dialectal Arabic. link

mix

For detailed benchmarking, see https://acr.ps/neoarabert.

How to Use

Install these libraries:

pip install fast-disambig torch==2.5.1 transformers==4.49.0 xformers==0.0.28.post3

Load the model and use it to generate embeddings:

from transformers import AutoModel, AutoTokenizer

model_name = "U4RASD/NeoAraBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

# Tokenize input text
text = "المركز العربيّ للأبحاث ودراسة السياسات."
inputs = tokenizer(text, return_tensors="pt")

# Generate embeddings
outputs = model(**inputs)
embedding = outputs.last_hidden_state[:, 0, :]
print(embedding.shape)

Citation

If you use the code, model, or the Muradif benchmark, please cite:

@inproceedings{abou-chakra-etal-2026-neoarabert,
  title = "{NeoAraBERT}: A Modern Foundation Model for Arabic Embeddings with Diacritics-Aware Tokenization and POS-Targeted Masking",
  author = "Abou Chakra, Chadi and
            Hamoud, Hadi and
            Al Mraikhat, Osama Rakan and
            Abu Obaida, Qusai and
            Ballout, Mohamad and
            Zaraket, Fadi A.",
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2026",
  address = "San Diego, California, United States",
  year = "2026",
  note = "Accepted paper",
  url = "https://acr.ps/neoarabert",
  abstract = {We present NeoAraBERT, a state-of-the-art open-source Arabic text-embedding model built on the NeoBERT architecture. We pre-train NeoAraBERT on diverse open-source and internal datasets covering modern standard, classical, and dialectal Arabic. We guided our design choices with Arabic tailored ablation studies including text normalization, light stemming, and diacritics-aware tokenization handling. We also performed more general POS-aware token masking and learning-rate scheduling ablation studies. We benchmarked NeoAraBERT against five top-performing Arabic models on 23 tasks, including a novel synonym-based task, ``Muradif'', that directly assesses embedding quality with no additional fine-tuning. NeoAraBERT variants (MSA, dialectal, and mixed) rank first in 18 tasks, second in two, third in two, and fourth in one task. They show strong performance on classical and modern standard Arabic, substantial margins of improvement ($>$7\%) in two tasks, and a $+$2.75\% improvement on average across all tasks. Our code and links to checkpoints for our model variants are available on our website: \url{https://acr.ps/neoarabert}}
}

License

This model is licensed under the CC BY-SA 4.0 license. The text of the license can be found here.

Downloads last month
150
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for U4RASD/NeoAraBERT

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including U4RASD/NeoAraBERT