| | --- |
| | tags: |
| | - espnet |
| | - audio |
| | - automatic-speech-recognition |
| | language: et |
| | license: cc-by-4.0 |
| | --- |
| | |
| | # Estonian Espnet2 ASR model |
| |
|
| | ## Model description |
| | This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech. |
| |
|
| | ## Intended uses & limitations |
| |
|
| | This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc. |
| |
|
| |
|
| | ## How to use |
| | ```python |
| | |
| | from espnet2.bin.asr_inference import Speech2Text |
| | |
| | model = Speech2Text.from_pretrained( |
| | "TalTechNLP/espnet2_estonian", |
| | lm_weight=0.6, ctc_weight=0.4, beam_size=60 |
| | ) |
| | |
| | # read a sound file with 16k sample rate |
| | import soundfile |
| | speech, rate = soundfile.read("speech.wav") |
| | assert rate == 16000 |
| | text, *_ = model(speech) |
| | print(text[0]) |
| | ``` |
| |
|
| | #### Limitations and bias |
| |
|
| | Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following: |
| | * Speech containing technical and other domain-specific terms |
| | * Children's speech |
| | * Non-native speech |
| | * Speech recorded under very noisy conditions or with a microphone far from the speaker |
| | * Very spontaneous and overlapping speech |
| |
|
| | ## Training data |
| | Acoustic training data: |
| |
|
| | | Type | Amount (h) | |
| | |-----------------------|:------:| |
| | | Broadcast speech | 591 | |
| | | Spontaneous speech | 53 | |
| | | Elderly speech corpus | 53 | |
| | | Talks, lectures | 49 | |
| | | Parliament speeches | 31 | |
| | | *Total* | *761* | |
| |
|
| | Language model training data: |
| | * Estonian National Corpus 2019 |
| | * OpenSubtitles |
| | * Speech transcripts |
| |
|
| | ## Training procedure |
| |
|
| | Standard EspNet2 Conformer recipe. |
| |
|
| | ## Evaluation results |
| |
|
| | ### WER |
| |
|
| | |dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| |
| | |---|---|---|---|---|---|---|---|---| |
| | |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/aktuaalne2021.testset|2864|56575|93.1|4.5|2.4|2.0|8.9|63.4| |
| | |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.devset|273|4677|93.9|3.6|2.4|1.2|7.3|46.5| |
| | |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/jutusaated.testset|818|11093|94.7|2.7|2.5|0.9|6.2|45.0| |
| | |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.devset|1207|13865|82.3|8.5|9.3|3.4|21.2|74.1| |
| | |decode_asr_lm_lm_large_valid.loss.ave_5best_asr_model_valid.acc.ave/www-trans.testset|1648|22707|86.4|7.6|6.0|2.5|16.1|75.7| |
| | |
| | |
| | ### BibTeX entry and citation info |
| | |
| | |
| | |
| | #### Citing ESPnet |
| | ```BibTex |
| | @inproceedings{watanabe2018espnet, |
| | author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson {Enrique Yalta Soplin} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai}, |
| | title={{ESPnet}: End-to-End Speech Processing Toolkit}, |
| | year={2018}, |
| | booktitle={Proceedings of Interspeech}, |
| | pages={2207--2211}, |
| | doi={10.21437/Interspeech.2018-1456}, |
| | url={http://dx.doi.org/10.21437/Interspeech.2018-1456} |
| | } |
| | ``` |