BurnyCoder
/

EsperBERTo

Model card Files Files and versions

EsperBERTo / README.md

BurnyCoder's picture

Initial upload of EsperBERTo model

c0e955f verified 9 months ago

|

history blame contribute delete

3.24 kB

	---
	language: eo
	license: mit
	---

	# EsperBERTo: A RoBERTa-like model for Esperanto

	This is a RoBERTa-like model trained from scratch on the Esperanto language.

	## Model description

	The model has 6 layers, 768 hidden size, 12 attention heads, and a total of 84 million parameters. It's based on the RoBERTa architecture. The tokenizer is a byte-level Byte-Pair Encoding (BPE) tokenizer trained from scratch on the same Esperanto corpus.

	- Model: RoBERTa-like
	- Layers: 6
	- Hidden size: 768
	- Heads: 12
	- Parameters: 84M
	- Tokenizer: Byte-level BPE
	- Vocabulary size: 52,000

	## Training data

	The model was trained on the Esperanto portion of the OSCAR corpus (`oscar.eo.txt`), which is approximately 3GB in size.

	## Training procedure

	The model was trained for one epoch on the OSCAR corpus using the `Trainer` API from the `transformers` library. The training was performed on a single GPU.

	### Hyperparameters
	- `output_dir`: "./EsperBERTo"
	- `overwrite_output_dir`: `True`
	- `num_train_epochs`: 1
	- `per_gpu_train_batch_size`: 64
	- `save_steps`: 10_000
	- `save_total_limit`: 2
	- `prediction_loss_only`: `True`

	The final training loss was `6.1178`.

	## Evaluation results

	The model was not evaluated on a downstream task in the notebook. However, its capabilities can be tested using the `fill-mask` pipeline.

	Example 1:
	```python
	from transformers import pipeline

	fill_mask = pipeline(
	"fill-mask",
	model="./EsperBERTo",
	tokenizer="./EsperBERTo"
	)

	fill_mask("La suno <mask>.")
	```
	Output:
	```
	[{'score': 0.013023526407778263, 'token': 316, 'token_str': ' estas', 'sequence': 'La suno estas.'},
	{'score': 0.008523152209818363, 'token': 607, 'token_str': ' min', 'sequence': 'La suno min.'},
	{'score': 0.007405377924442291, 'token': 2575, 'token_str': ' okuloj', 'sequence': 'La suno okuloj.'},
	{'score': 0.007219308987259865, 'token': 1635, 'token_str': ' tago', 'sequence': 'La suno tago.'},
	{'score': 0.006888304837048054, 'token': 394, 'token_str': ' estis', 'sequence': 'La suno estis.'}]
	```

	Example 2:
	```python
	fill_mask("Jen la komenco de bela <mask>.")
	```
	Output:
	```
	[{'score': 0.016247423365712166, 'token': 1635, 'token_str': ' tago', 'sequence': 'Jen la komenco de bela tago.'},
	{'score': 0.009718689136207104, 'token': 1021, 'token_str': ' tempo', 'sequence': 'Jen la komenco de bela tempo.'},
	{'score': 0.007543196901679039, 'token': 2257, 'token_str': ' kongreso', 'sequence': 'Jen la komenco de bela kongreso.'},
	{'score': 0.0071307034231722355, 'token': 1161, 'token_str': ' vivo', 'sequence': 'Jen la komenco de bela vivo.'},
	{'score': 0.006644904613494873, 'token': 758, 'token_str': ' jaroj', 'sequence': 'Jen la komenco de bela jaroj.'}]
	```

	## Intended uses & limitations

	This model is intended to be a general-purpose language model for Esperanto. It can be used for masked language modeling and can be fine-tuned for various downstream tasks such as:
	- Text Classification
	- Token Classification (Part-of-Speech Tagging, Named Entity Recognition)
	- Question Answering

	Since the model was trained on a relatively small dataset, its performance may be limited. For better results on specific tasks, fine-tuning on a relevant dataset is recommended.