This is a Icefall Zipformer Estonian streaming speech-to-text model. A larger version of this model is avialble here: https://huggingface.co/TalTechNLP/streaming-zipformer-large.et-en
Use via sherpa-onnx:
sherpa-onnx --encoder=encoder.onnx --decoder=decoder.onnx --joiner=joiner.onnx --tokens=tokens.txt audio.wav
Or using the int8 quantized version:
sherpa-onnx --encoder=encoder.int8.onnx --decoder=decoder.int8.onnx --joiner=joiner.int8.onnx --tokens=tokens.txt audio.wav
This model is used in the web-based Estonian ASR application https://eestiasr.vercel.app/.
The model was trained on roughly 1334 hours of manually transcribed Estonian speech from the TalTech Estonian Speech Dataset 1.0. In addition, training relied on around 4000 hours of automatically transcribed Estonian public TV (ETV) data, consisting of news and talk shows, and a 500-hour subset of the Gigaspeech dataset, which includes YouTube videos and podcasts. For automatically transcribing the Estonian data, we used Whisper large-v3-turbo, finetuned on the TalTech Estonian Speech Dataset 1.0. The ASR model produces properly capitalized and punctuated text. A subset of Gigaspeech was intermixed with Estonian data to improve the model's ability to transcribe English terms and expressions that are often embedded into Estonian sentences, especially in technological domains. Since the original transcripts of Gigaspeech are uppercase and not punctuated, we retranscribed the 500-hour subset using Whisper large-v3-turbo.