Streaming Speech Translation Pipeline

Real-time English β†’ Russian speech translation: Audio In β†’ ASR β†’ NMT β†’ TTS β†’ Audio Out

Translates spoken English into spoken Russian with streaming output over WebSocket.

Architecture

Audio Input β†’ ASR (ONNX) β†’ NMT (GGUF) β†’ TTS (ONNX) β†’ Audio Output
  (PCM16)   Conformer RNN-T  TranslateGemma  XTTSv2     (PCM16)
  • ASR: NVIDIA NeMo Conformer RNN-T (cache-aware streaming, ONNX)
  • NMT: TranslateGemma 4B (GGUF Q8_0, llama-cpp-python) with streaming segmentation and translation merging
  • TTS: XTTSv2 with GPT-2 AR model + HiFi-GAN vocoder (ONNX), 24kHz output

See ARCHITECTURE.md for detailed design documentation.

Requirements

  • Python 3.10+
  • Model files:
    • ASR: NeMo Conformer RNN-T ONNX model directory
    • NMT: TranslateGemma 4B GGUF file
    • TTS: XTTSv2 ONNX model directory, BPE vocab, mel normalization stats, reference audio

Installation

pip install -r requirements.txt

System Dependencies

# Ubuntu/Debian
apt-get install libsndfile1 libportaudio2

Usage

Start the Server

python app.py \
  --asr-onnx-path /path/to/asr_onnx_model/ \
  --nmt-gguf-path /path/to/translategemma-4b-it-q8_0.gguf \
  --tts-model-dir /path/to/xtts_onnx_models/ \
  --tts-vocab-path /path/to/vocab.json \
  --tts-mel-norms-path /path/to/mel_stats.npy \
  --tts-ref-audio-path /path/to/reference_speaker.wav \
  --host 0.0.0.0 \
  --port 8765

CLI Options

Flag Default Description
--asr-onnx-path (required) ASR ONNX model directory
--asr-chunk-ms 10 ASR audio chunk duration (ms)
--asr-sample-rate 16000 ASR expected sample rate
--nmt-gguf-path (required) NMT GGUF model file
--nmt-n-threads 4 NMT CPU threads
--tts-model-dir (required) TTS ONNX model directory
--tts-vocab-path (required) TTS BPE vocab.json
--tts-mel-norms-path (required) TTS mel_stats.npy
--tts-ref-audio-path (required) TTS reference speaker audio
--tts-language ru TTS target language code
--tts-int8-gpt True Use INT8 quantized GPT
--tts-threads-gpt 2 TTS GPT ONNX threads
--tts-chunk-size 20 TTS AR tokens per vocoder chunk
--audio-queue-max 256 Audio input queue max size
--text-queue-max 64 Text queue max size
--tts-queue-max 16 NMT→TTS text queue max size
--audio-out-queue-max 32 Audio output queue max size
--host 0.0.0.0 Server bind host
--port 8765 Server port

Python Client

Captures microphone audio and plays back translated speech:

pip install sounddevice
python clients/python_client.py --uri ws://localhost:8765 --sample-rate 48000

Web Client

Open clients/web_client.html in a browser. Click "Connect" to start streaming.

WebSocket Protocol

Direction Type Format Description
Client→ Binary PCM16 Raw audio at declared sample rate
Client→ Text JSON {"action": "start", "sample_rate": 48000}
Client→ Text JSON {"action": "stop"}
β†’Client Binary PCM16 Synthesized audio at 24kHz
β†’Client Text JSON {"type": "transcript", "text": "..."}
β†’Client Text JSON {"type": "translation", "text": "..."}
β†’Client Text JSON {"type": "status", "status": "started"}

Docker

docker build -t streaming-translation .
docker run -p 8765:8765 \
  -v /path/to/models:/models \
  streaming-translation \
  --asr-onnx-path /models/asr/ \
  --nmt-gguf-path /models/translategemma-4b-it-q8_0.gguf \
  --tts-model-dir /models/xtts/ \
  --tts-vocab-path /models/xtts/vocab.json \
  --tts-mel-norms-path /models/xtts/mel_stats.npy \
  --tts-ref-audio-path /models/reference.wav

Project Structure

streaming_speech_translation/
β”œβ”€β”€ app.py                              # Main entry point
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ ARCHITECTURE.md
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ asr/
β”‚   β”‚   β”œβ”€β”€ streaming_asr.py            # StreamingASR wrapper
β”‚   β”‚   β”œβ”€β”€ pipeline.py                 # ThreadedSpeechTranslator (reference)
β”‚   β”‚   β”œβ”€β”€ cache_aware_modules.py      # Audio buffer + streaming ASR
β”‚   β”‚   β”œβ”€β”€ cache_aware_modules_config.py
β”‚   β”‚   β”œβ”€β”€ modules.py                  # ONNX model loading
β”‚   β”‚   β”œβ”€β”€ modules_config.py
β”‚   β”‚   β”œβ”€β”€ onnx_utils.py
β”‚   β”‚   └── utils.py                    # Audio utilities
β”‚   β”œβ”€β”€ nmt/
β”‚   β”‚   β”œβ”€β”€ streaming_nmt.py            # StreamingNMT wrapper
β”‚   β”‚   β”œβ”€β”€ streaming_segmenter.py      # Word-group segmentation
β”‚   β”‚   β”œβ”€β”€ streaming_translation_merger.py
β”‚   β”‚   └── translator_module.py        # TranslateGemma via llama-cpp
β”‚   β”œβ”€β”€ tts/
β”‚   β”‚   β”œβ”€β”€ streaming_tts.py            # StreamingTTS wrapper
β”‚   β”‚   β”œβ”€β”€ xtts_streaming_pipeline.py  # Full TTS pipeline
β”‚   β”‚   β”œβ”€β”€ xtts_onnx_orchestrator.py   # GPT-2 AR + vocoder
β”‚   β”‚   β”œβ”€β”€ xtts_tokenizer.py           # BPE tokenizer
β”‚   β”‚   └── zh_num2words.py             # Chinese text normalization
β”‚   β”œβ”€β”€ pipeline/
β”‚   β”‚   β”œβ”€β”€ orchestrator.py             # PipelineOrchestrator
β”‚   β”‚   └── config.py                   # PipelineConfig
β”‚   └── server/
β”‚       └── websocket_server.py         # WebSocket server
└── clients/
    β”œβ”€β”€ python_client.py                # Python CLI client
    └── web_client.html                 # Browser client
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pltobing/streaming-speech-translation

Base model

coqui/XTTS-v2
Finetuned
(62)
this model