AudioBook / README.md
PlotweaverModel's picture
update
500a984 verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: Audiobook Generator - English to 36 Languages
emoji: πŸ“–
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 5.25.0
app_file: app.py
pinned: false
license: mit

πŸ“– Audiobook Generator β€” English to 36 Languages

Paste or upload English text and generate a professionally narrated audiobook in any of 36 languages, powered by Alibaba's Qwen3.5-Omni-Plus.

Features

  • Translation + Narration: Translates English text and generates expressive speech in the target language
  • Direct Narration: Generate English audiobooks without translation
  • 29 narrator voices: Male and female voices with different styles (cinematic, warm, dramatic, etc.)
  • Smart text splitting: Handles long texts by splitting at sentence/paragraph boundaries
  • MP3 output: Compressed for easy download and sharing
  • Section pauses: Optional natural pauses between text sections

Setup

  1. Add your DashScope API key as a Space Secret:
    • Settings β†’ Secrets β†’ New Secret
    • Name: DASHSCOPE_API_KEY
    • Value: your key (get one here)

Supported Languages (36)

⭐ Core Languages (Best Quality)

English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Extended Languages

Arabic, Bengali, Cantonese, Czech, Danish, Dutch, Filipino, Finnish, Greek, Hebrew, Hindi, Hungarian, Indonesian, Malay, Norwegian, Persian, Polish, Romanian, Swahili, Swedish, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese

How It Works

  1. Your English text is split into manageable chunks at sentence boundaries
  2. Each chunk is sent to qwen3.5-omni-plus with instructions to translate (if needed) and narrate
  3. The model generates expressive speech with audiobook-quality narration
  4. All audio chunks are concatenated and converted to MP3
  5. Download your audiobook!

Limitations

  • Processing time: ~30-60 seconds per ~1500 characters
  • Extended languages may have variable voice quality compared to the core 10
  • Very long texts (100k+ characters) may take significant time
  • The model generates speech at its own pace, so timing won't match a human narrator exactly

Credits