| | --- |
| | language: eo |
| | license: mit |
| | --- |
| | |
| | # EsperBERTo: A RoBERTa-like model for Esperanto |
| |
|
| | This is a RoBERTa-like model trained from scratch on the Esperanto language. |
| |
|
| | ## Model description |
| |
|
| | The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus. |
| |
|
| | - **Model:** RoBERTa-like |
| | - **Layers:** 6 |
| | - **Hidden size:** 768 |
| | - **Heads:** 12 |
| | - **Parameters:** 84M |
| | - **Tokenizer:** Byte-level BPE |
| | - **Vocabulary size:** 52,000 |
| |
|
| | ## Training data |
| |
|
| | The model was trained on the Esperanto portion of the OSCAR corpus (`oscar.eo.txt`), which is approximately 3GB in size. |
| |
|
| | ## Training procedure |
| |
|
| | The model was trained for one epoch on the OSCAR corpus using the `Trainer` API from the `transformers` library. The training was performed on a single GPU. |
| |
|
| | ### Hyperparameters |
| | - `output_dir`: "./EsperBERTo" |
| | - `overwrite_output_dir`: `True` |
| | - `num_train_epochs`: 1 |
| | - `per_gpu_train_batch_size`: 64 |
| | - `save_steps`: 10_000 |
| | - `save_total_limit`: 2 |
| | - `prediction_loss_only`: `True` |
| | |
| | The final training loss was `6.1178`. |
| | |
| | ## Evaluation results |
| | |
| | The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the `fill-mask` pipeline. |
| | |
| | Example 1: |
| | ```python |
| | from transformers import pipeline |
| | |
| | fill_mask = pipeline( |
| | "fill-mask", |
| | model="./EsperBERTo", |
| | tokenizer="./EsperBERTo" |
| | ) |
| | |
| | fill_mask("La suno <mask>.") |
| | ``` |
| | Output: |
| | ``` |
| | [{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'}, |
| | {'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'}, |
| | {'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'}, |
| | {'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'}, |
| | {'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}] |
| | ``` |
| | |
| | Example 2: |
| | ```python |
| | fill_mask("Jen la komenco de bela <mask>.") |
| | ``` |
| | Output: |
| | ``` |
| | [{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'}, |
| | {'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'}, |
| | {'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'}, |
| | {'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'}, |
| | {'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}] |
| | ``` |
| | |
| | ## Intended uses & limitations |
| | |
| | This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as: |
| | - Text Classification |
| | - Token Classification (Part-of-Speech Tagging, Named Entity Recognition) |
| | - Question Answering |
| | |
| | Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended. |