Try Orpheus TTS here
Transcribe audio files to text with language detection
Generate speech from text with selectable voices