Papers
arxiv:2605.03073

The TTS-STT Flywheel: Synthetic Entity-Dense Audio Closes the Indic ASR Gap Where Commercial and Open-Source Systems Fail

Published on May 4
· Submitted by
Venkata Pushpak Teja Menta
on May 6
Authors:

Abstract

A self-contained text-to-speech to speech-to-text flywheel approach significantly improves niche-domain Indic automatic speech recognition performance through synthetic data generation and low-resource fine-tuning techniques.

AI-generated summary

Niche-domain Indic ASR -- digit strings, currency amounts, addresses, brand names, English/Indic codemix -- is under-served by both open-source SOTA and commercial systems. On a synthesised entity-dense Telugu test set (held-out by synthesis system), vasista22/whisper-telugu-large-v2 (open SOTA) achieves Entity-Hit-Rate (EHR) 0.027 and Deepgram Nova-3 (commercial) 0.16. We close this gap with a self-contained TTS<->STT flywheel: an open-source Indic TTS pipeline synthesises ~22,000 entity-dense Indic-English code-mix utterances at <$50 marginal cost, and a LoRA fine-tune on top of vasista22 achieves EHR 0.473 on the held-out test (17x over open SOTA, 3x over commercial), with read-prose regression bounded to +6.6 pp WER on FLEURS-Te. Cross-language: beta-Hi 0.337 (7x vs vasista22) and beta-Ta 0.543 (22x vs vasista22, 22x vs Deepgram); on Hindi where Deepgram has substantial entity coverage, the flywheel underperforms commercial. All three beta models fall below pre-registered EHR targets (0.75 for Te, 0.65 for Hi/Ta); we report honestly. A native-human-recorded sanity check (n=20 Telugu) confirms transfer to real speech (beta-Te EHR 0.516 on native vs 0.473 on synth). An EDSA-isolation ablation (LoRA on FLEURS-Te alone) yields EHR 0.020 on the same held-out, attributing ~100% of the gain to the EDSA corpus. We additionally report a language-conditional finding: vanilla Whisper-large-v3 has Telugu-specific Script Collapse (SFR 0.46-0.71) that a per-language LoRA corrects (SFR 0.81-0.97), but the recipe is contraindicated on Hindi and Tamil where vanilla SFR >= 0.98. Code, holdouts, predictions, EDSA corpus, and entity dictionaries are released open-source.

Community

Paper submitter

We benchmark open-source SOTA (vasista22/whisper-{te,ta,hi}-large-v2) and commercial Deepgram Nova-3 on a synthesised entity-dense Telugu test set — content that real Indian users actually speak: digit strings, currency amounts, addresses, brand names, English/Indic codemix. Open-source SOTA gets EHR 0.027. Commercial Deepgram gets 0.16. Both are an order of magnitude below their own read-prose performance.

We close the gap with a self-contained TTS↔STT flywheel: an open-source Indic TTS pipeline (Praxy R6 / vanilla Chatterbox / IndicF5 / ElevenLabs / Cartesia) synthesises ~22k entity-dense Indic-English utterances at <$50 marginal cost, and a LoRA on top of vasista22 reaches EHR 0.473 on Telugu (17× over open SOTA, 3× over commercial), 0.337 on Hindi, 0.543 on Tamil. Two of three languages beat commercial Deepgram. Native human-recorded sanity check confirms transfer: 0.516 EHR on real Telugu speech vs 0.473 on synth.

Honest reporting throughout: all three β models miss the pre-registered EHR target (0.75 Te, 0.65 Hi/Ta), Hindi underperforms commercial (Deepgram has invested there), and the secondary "fix Whisper-v3 Telugu Script Collapse via per-language LoRA" recipe is contraindicated on Hindi/Tamil where vanilla SFR ≥ 0.98. An EDSA-isolation ablation (LoRA on FLEURS-Te alone → EHR 0.020) attributes ~100% of the gain to the entity-dense corpus, not the LoRA process.

Code, holdouts, predictions, EDSA corpus, and entity dictionaries open-source. Six LoRA adapters released on HF (te/hi/ta × {rb on vasista22, r2 on Whisper-v3}). Companion to arXiv:2604.25441 (Praxy Voice TTS), arXiv:2604.25476 (PSP), arXiv:2605.00777 (LASE).

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.03073
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.03073 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.03073 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.