Streaming Speech Translation Pipeline
Real-time English β Russian speech translation: Audio In β ASR β NMT β TTS β Audio Out
Translates spoken English into spoken Russian with streaming output over WebSocket.
Architecture
Audio Input β ASR (ONNX) β NMT (GGUF) β TTS (ONNX) β Audio Output
(PCM16) Conformer RNN-T TranslateGemma XTTSv2 (PCM16)
- ASR: NVIDIA NeMo Conformer RNN-T (cache-aware streaming, ONNX)
- NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging
- TTS: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output
See ARCHITECTURE.md for detailed design documentation.
Requirements
- Python 3.10+
- Model files:
- ASR: NeMo Conformer RNN-T ONNX model directory
- NMT: TranslateGemma 4B GGUF file
- TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio
Installation
pip install -r requirements.txt
System Dependencies
# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2
Usage
Start the Server
python app.py \
--asr-onnx-path /path/to/asr_onnx_model/ \
--nmt-gguf-path /path/to/translategemma-4b-it-q8_0.gguf \
--tts-model-dir /path/to/xtts_onnx_models/ \
--tts-vocab-path /path/to/vocab.json \
--tts-mel-norms-path /path/to/mel_stats.npy \
--tts-ref-audio-path /path/to/reference_speaker.wav \
--host 0.0.0.0 \
--port 8765
CLI Options
| Flag | Default | Description |
|---|---|---|
--asr-onnx-path |
(required) | ASR ONNX model directory |
--asr-chunk-ms |
10 | ASR audio chunk duration (ms) |
--asr-sample-rate |
16000 | ASR expected sample rate |
--nmt-gguf-path |
(required) | NMT GGUF model file |
--nmt-n-threads |
4 | NMT CPU threads |
--tts-model-dir |
(required) | TTS ONNX model directory |
--tts-vocab-path |
(required) | TTS BPE vocab.json |
--tts-mel-norms-path |
(required) | TTS mel_stats.npy |
--tts-ref-audio-path |
(required) | TTS reference speaker audio |
--tts-language |
ru | TTS target language code |
--tts-int8-gpt |
True | Use INT8 quantized GPT |
--tts-threads-gpt |
2 | TTS GPT ONNX threads |
--tts-chunk-size |
20 | TTS AR tokens per vocoder chunk |
--audio-queue-max |
256 | Audio input queue max size |
--text-queue-max |
64 | Text queue max size |
--tts-queue-max |
16 | NMTβTTS text queue max size |
--audio-out-queue-max |
32 | Audio output queue max size |
--host |
0.0.0.0 | Server bind host |
--port |
8765 | Server port |
Python Client
Captures microphone audio and plays back translated speech:
pip install sounddevice
python clients/python_client.py --uri ws://localhost:8765 --sample-rate 48000
Web Client
Open clients/web_client.html in a browser. Click "Connect" to start streaming.
WebSocket Protocol
| Direction | Type | Format | Description |
|---|---|---|---|
| Clientβ | Binary | PCM16 | Raw audio at declared sample rate |
| Clientβ | Text | JSON | {"action": "start", "sample_rate": 48000} |
| Clientβ | Text | JSON | {"action": "stop"} |
| βClient | Binary | PCM16 | Synthesized audio at 24kHz |
| βClient | Text | JSON | {"type": "transcript", "text": "..."} |
| βClient | Text | JSON | {"type": "translation", "text": "..."} |
| βClient | Text | JSON | {"type": "status", "status": "started"} |
Docker
docker build -t streaming-translation .
docker run -p 8765:8765 \
-v /path/to/models:/models \
streaming-translation \
--asr-onnx-path /models/asr/ \
--nmt-gguf-path /models/translategemma-4b-it-q8_0.gguf \
--tts-model-dir /models/xtts/ \
--tts-vocab-path /models/xtts/vocab.json \
--tts-mel-norms-path /models/xtts/mel_stats.npy \
--tts-ref-audio-path /models/reference.wav
Project Structure
streaming_speech_translation/
βββ app.py # Main entry point
βββ requirements.txt
βββ Dockerfile
βββ ARCHITECTURE.md
βββ src/
β βββ asr/
β β βββ streaming_asr.py # StreamingASR wrapper
β β βββ pipeline.py # ThreadedSpeechTranslator (reference)
β β βββ cache_aware_modules.py # Audio buffer + streaming ASR
β β βββ cache_aware_modules_config.py
β β βββ modules.py # ONNX model loading
β β βββ modules_config.py
β β βββ onnx_utils.py
β β βββ utils.py # Audio utilities
β βββ nmt/
β β βββ streaming_nmt.py # StreamingNMT wrapper
β β βββ streaming_segmenter.py # Word-group segmentation
β β βββ streaming_translation_merger.py
β β βββ translator_module.py # TranslateGemma via llama-cpp
β βββ tts/
β β βββ streaming_tts.py # StreamingTTS wrapper
β β βββ xtts_streaming_pipeline.py # Full TTS pipeline
β β βββ xtts_onnx_orchestrator.py # GPT-2 AR + vocoder
β β βββ xtts_tokenizer.py # BPE tokenizer
β β βββ zh_num2words.py # Chinese text normalization
β βββ pipeline/
β β βββ orchestrator.py # PipelineOrchestrator
β β βββ config.py # PipelineConfig
β βββ server/
β βββ websocket_server.py # WebSocket server
βββ clients/
βββ python_client.py # Python CLI client
βββ web_client.html # Browser client
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for pltobing/streaming-speech-translation
Base model
coqui/XTTS-v2