ShamiBERT ๐ธ๐พ๐ฑ๐ง๐ฏ๐ด๐ต๐ธ
ShamiBERT is a BERT-based language model specialized for Levantine Arabic (ุงูุดุงู ูุฉ) โ the dialect spoken in Syria, Lebanon, Jordan, and Palestine.
Model Description
ShamiBERT was created by performing continual pre-training on aubmindlab/bert-base-arabertv02-twitter using Masked Language Modeling (MLM) on Levantine Arabic text data.
Architecture
- Base Model: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
- Architecture: BERT-base (12 layers, 12 attention heads, 768 hidden size)
- Task: Masked Language Modeling (MLM)
- Training: Continual pre-training on Levantine dialect data
Why AraBERT-Twitter as base?
- Pre-trained on 77GB Arabic text + 60M Arabic tweets
- Already handles dialectal Arabic and social media text
- Supports emojis in vocabulary
- Strong foundation for dialect-specific adaptation
Training Data
ShamiBERT was trained on a combination of Levantine Arabic datasets:
| Dataset | Source | Description |
|---|---|---|
| QCRI Arabic POS (LEV) | HuggingFace | Levantine tweets with POS tags |
| Levanti | HuggingFace | Palestinian/Syrian/Lebanese/Jordanian sentences |
| Curated Shami | Manual | Hand-curated Levantine expressions and phrases |
Training Details
- Epochs: 5
- Learning Rate: 2e-05
- Batch Size: 128 (effective)
- Max Sequence Length: 128
- MLM Probability: 0.15
- Optimizer: AdamW (ฮฒ1=0.9, ฮฒ2=0.999, ฮต=1e-6)
- Weight Decay: 0.01
- Warmup: 10%
- Eval Perplexity: 5.04
Usage
Fill-Mask (ุชุนุจุฆุฉ ุงูููุงุน)
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")
# Shami examples
results = fill_mask("ูููู [MASK] ุงูุญู
ุฏููู")
for r in results[:3]:
print(f"{r['token_str']} ({r['score']:.4f})")
Feature Extraction (for downstream tasks)
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")
text = "ุดู ุฃุฎุจุงุฑู ูุง ุฒูู
ุฉ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]
Preprocessing (recommended)
from arabert.preprocess import ArabertPreprocessor
prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("ูููู ูุง ุฎูู")
Intended Uses
ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:
- Sentiment Analysis of Levantine social media
- Text Classification (topic, dialect sub-identification)
- Named Entity Recognition in Shami text
- Feature Extraction for downstream tasks
- Dialect Identification (Levantine vs other Arabic dialects)
Limitations
- Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
- Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
- Based on AraBERT-Twitter which was trained with max_length=64
- Not suitable for MSA-heavy or non-Levantine dialect tasks
Citation
@misc{shamibert2026,
title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
year={2026},
note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}
Acknowledgments
- AraBERT team (AUB MIND Lab) for the base model
- ArSyra team for Levantine dialect data
- QCRI for Arabic dialect resources
- Unsloth team for training optimizations
- Downloads last month
- 16
Model tree for mabahboh/ShamiBERT
Base model
aubmindlab/bert-base-arabertv02-twitter