| | --- |
| | license: cc-by-nc-4.0 |
| | language: |
| | - fa |
| | tags: |
| | - masked-language-modeling |
| | - feature-extraction |
| | - large-scale-dataset |
| | - Persian |
| | - dataset_size:72.9B |
| | - no-next-sentence-prediction |
| | pipeline_tag: fill-mask |
| |
|
| | extra_gated_description: >- |
| | You agree to not use the model to conduct experiments that cause harm to |
| | human subjects. |
| | extra_gated_fields: |
| | Full Name: text |
| | Organization (University): text |
| | Email address: text |
| | Country: country |
| | Could you briefly explain the purpose of using the dataset?: text |
| | I agree to use this dataset for non-commercial use ONLY: checkbox |
| | --- |
| | |
| | # Persian Masked Language Model (MLM) |
| |
|
| | This model is a **Masked Language Model (MLM)** trained on a **72.9-billion-token corpus** of Persian text, making it one of the largest and most comprehensive models pre-trained exclusively for the Persian language. The model is designed to enhance **language understanding tasks** and provide high-quality contextual embeddings for various NLP applications in Persian. |
| |
|
| | - **Our Paper:** Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization [link](https://arxiv.org/abs/2501.04858) |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| | - **Model Type:** Masked Language Model (MLM) |
| | - **Base Model:** XLM-RoBERTa Large |
| | - **Objective:** Predicting randomly masked tokens within sequences |
| | - **Training Corpus Size:** 72.9 billion tokens |
| | - **Maximum Sequence Length:** 512 tokens |
| | - **Special Feature:** No Next Sentence Prediction (NSP) task |
| |
|
| | ## Training Details |
| |
|
| | ### Training Configuration |
| | - **Hardware:** 8 NVIDIA A800 GPUs |
| | - **Duration:** One week |
| | - **Optimization Framework:** DeepSpeed (Stage 0) |
| | - **Training Parameters:** |
| | - **Learning Rate:** 5e-5 |
| | - **Maximum Sequence Length:** 512 tokens |
| | - **Precision:** FP16 (Mixed Precision) |
| |
|
| | ### Corpus |
| | The model was pre-trained on a large-scale corpus of Persian text collected from diverse sources, ensuring broad language coverage and contextual diversity: |
| | - Web-crawled data |
| | - Academic articles and books |
| | - Persian Wikipedia |
| | - Religious texts |
| | - Social media platforms |
| |
|
| | The data underwent extensive preprocessing, including deduplication and noise removal, to ensure high-quality training data. |
| |
|
| | ## Usage |
| |
|
| | The model can be used for various **downstream NLP tasks** in Persian, including: |
| | - Text classification |
| | - Named entity recognition |
| | - Question answering |
| | - Semantic search |
| | - Contextual embedding generation |
| |
|
| | ### Example Usage |
| | This model can be loaded and used with the 🤗 Transformers library: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForMaskedLM |
| | |
| | # Load tokenizer and model |
| | tokenizer = AutoTokenizer.from_pretrained("your_model_id") |
| | model = AutoModelForMaskedLM.from_pretrained("your_model_id") |
| | |
| | # Example text |
| | text = "این یک [MASK] جدید است." |
| | inputs = tokenizer(text, return_tensors="pt") |
| | |
| | # Predict the masked token |
| | outputs = model(**inputs) |
| | logits = outputs.logits |
| | ``` |
| |
|
| | ## Training procedure |
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 5e-05 |
| | - train_batch_size: 30 |
| | - eval_batch_size: 8 |
| | - seed: 42 |
| | - distributed_type: multi-GPU |
| | - num_devices: 8 |
| | - gradient_accumulation_steps: 2 |
| | - total_train_batch_size: 480 |
| | - total_eval_batch_size: 64 |
| | - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
| | - lr_scheduler_type: linear |
| | - num_epochs: 1.0 |
| | - mixed_precision_training: Native AMP |
| |
|
| | ### Framework versions |
| |
|
| | - Transformers 4.47.0.dev0 |
| | - Pytorch 2.4.1+cu121 |
| | - Datasets 3.0.2 |
| | - Tokenizers 0.20.1 |
| |
|
| | ## Citation |
| | If you find this model helpful, please ensure to cite the following paper. |
| |
|
| | **BibTeX:** |
| | ``` |
| | @misc{hosseinbeigi2025advancingretrievalaugmentedgenerationpersian, |
| | title={Advancing Retrieval-Augmented Generation for Persian: Development of Language Models, Comprehensive Benchmarks, and Best Practices for Optimization}, |
| | author={Sara Bourbour Hosseinbeigi and Sina Asghari and Mohammad Ali Seif Kashani and Mohammad Hossein Shalchian and Mohammad Amin Abbasi}, |
| | year={2025}, |
| | eprint={2501.04858}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2501.04858}, |
| | } |
| | ``` |