SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
Paper • 2309.07445 • Published • 1
How to use Davlan/afro-xlmr-large-76L_script with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("fill-mask", model="Davlan/afro-xlmr-large-76L_script") # Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("Davlan/afro-xlmr-large-76L_script")
model = AutoModelForMaskedLM.from_pretrained("Davlan/afro-xlmr-large-76L_script")AfroXLMR-large was created by first augmenting the XLM-R-large model with missing scripts (N'Ko and Tifinagh), followed by an MLM adaptation of the expanded XLM-R-large model on 76 languages widely spoken in Africa including 4 high-resource languages.
A mix of mC4, Wikipedia and OPUS data
There are 75 languages available :
@misc{adelani2023sib200,
title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects},
author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
year={2023},
eprint={2309.07445},
archivePrefix={arXiv},
primaryClass={cs.CL}
}