ShamiBERT ๐Ÿ‡ธ๐Ÿ‡พ๐Ÿ‡ฑ๐Ÿ‡ง๐Ÿ‡ฏ๐Ÿ‡ด๐Ÿ‡ต๐Ÿ‡ธ

ShamiBERT is a BERT-based language model specialized for Levantine Arabic (ุงู„ุดุงู…ูŠุฉ) โ€” the dialect spoken in Syria, Lebanon, Jordan, and Palestine.

Model Description

ShamiBERT was created by performing continual pre-training on aubmindlab/bert-base-arabertv02-twitter using Masked Language Modeling (MLM) on Levantine Arabic text data.

Architecture

  • Base Model: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
  • Architecture: BERT-base (12 layers, 12 attention heads, 768 hidden size)
  • Task: Masked Language Modeling (MLM)
  • Training: Continual pre-training on Levantine dialect data

Why AraBERT-Twitter as base?

  1. Pre-trained on 77GB Arabic text + 60M Arabic tweets
  2. Already handles dialectal Arabic and social media text
  3. Supports emojis in vocabulary
  4. Strong foundation for dialect-specific adaptation

Training Data

ShamiBERT was trained on a combination of Levantine Arabic datasets:

Dataset Source Description
QCRI Arabic POS (LEV) HuggingFace Levantine tweets with POS tags
Levanti HuggingFace Palestinian/Syrian/Lebanese/Jordanian sentences
Curated Shami Manual Hand-curated Levantine expressions and phrases

Training Details

  • Epochs: 5
  • Learning Rate: 2e-05
  • Batch Size: 128 (effective)
  • Max Sequence Length: 128
  • MLM Probability: 0.15
  • Optimizer: AdamW (ฮฒ1=0.9, ฮฒ2=0.999, ฮต=1e-6)
  • Weight Decay: 0.01
  • Warmup: 10%
  • Eval Perplexity: 5.04

Usage

Fill-Mask (ุชุนุจุฆุฉ ุงู„ู‚ู†ุงุน)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")

# Shami examples
results = fill_mask("ูƒูŠููƒ [MASK] ุงู„ุญู…ุฏู„ู„ู‡")
for r in results[:3]:
    print(f"{r['token_str']} ({r['score']:.4f})")

Feature Extraction (for downstream tasks)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")

text = "ุดูˆ ุฃุฎุจุงุฑูƒ ูŠุง ุฒู„ู…ุฉ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]

Preprocessing (recommended)

from arabert.preprocess import ArabertPreprocessor

prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("ูƒูŠููƒ ูŠุง ุฎูŠูŠ")

Intended Uses

ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:

  • Sentiment Analysis of Levantine social media
  • Text Classification (topic, dialect sub-identification)
  • Named Entity Recognition in Shami text
  • Feature Extraction for downstream tasks
  • Dialect Identification (Levantine vs other Arabic dialects)

Limitations

  • Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
  • Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
  • Based on AraBERT-Twitter which was trained with max_length=64
  • Not suitable for MSA-heavy or non-Levantine dialect tasks

Citation

@misc{shamibert2026,
    title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
    year={2026},
    note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}

Acknowledgments

  • AraBERT team (AUB MIND Lab) for the base model
  • ArSyra team for Levantine dialect data
  • QCRI for Arabic dialect resources
  • Unsloth team for training optimizations
Downloads last month
16
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mabahboh/ShamiBERT

Finetuned
(38)
this model

Datasets used to train mabahboh/ShamiBERT