walidsobhie-code
refactor: Squeeze folders further - cleaner structure
65888d5

Stack 2.9 Voice Integration Module

A comprehensive voice integration module that connects the Stack 2.9 coding assistant with voice cloning and text-to-speech capabilities.

Architecture Overview

This integration provides a complete voice-enabled coding assistant workflow:

Voice Input β†’ Speech-to-Text β†’ Stack 2.9 API β†’ Text Response β†’ Text-to-Speech β†’ Voice Output
     ↑                                                                       ↓
Voice Cloning ← Voice Models ← FastAPI Service ← Python Client ← Integration Layer

Core Components

  1. voice_server.py - FastAPI voice service with endpoints for:

    • POST /clone - Clone voice from audio samples
    • POST /synthesize - Text-to-speech with cloned voices
    • GET /voices - List available voice models
  2. voice_client.py - Python client for interacting with the voice API

  3. stack_voice_integration.py - Main integration with Stack 2.9

    • voice_chat() - Complete voice conversation workflow
    • voice_command() - Voice command execution
    • streaming_voice_chat() - Real-time voice streaming
  4. integration_example.py - Usage examples and demonstrations

Setup Instructions

Prerequisites

  • Python 3.8+
  • Docker & Docker Compose
  • Coqui TTS (for voice synthesis)
  • Optional: Vosk (for speech-to-text)

Installation

  1. Clone the voice models directory:

    mkdir -p voice_models audio_files
    
  2. Install Python dependencies:

    pip install fastapi uvicorn requests pydantic
    
  3. For GPU support (optional):

    pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
    

Running the Services

  1. Start the voice services:

    docker-compose up -d
    
  2. Start the FastAPI server:

    cd stack-2.9-voice
    uvicorn voice_server:app --host 0.0.0.0 --port 8000 --reload
    
  3. Test the API:

    curl http://localhost:8000/voices
    

API Reference

Voice Server API

GET /voices

List all available voice models.

Response:

{
  "voices": ["default", "custom_voice"],
  "count": 2
}

POST /clone

Clone a voice from an audio sample.

Request:

{
  "voice_name": "my_custom_voice"
}

Response:

{
  "success": true,
  "voice_name": "my_custom_voice",
  "message": "Voice model created successfully"
}

POST /synthesize

Generate speech with a cloned voice.

Request:

{
  "text": "Hello, this is a test.",
  "voice_name": "my_custom_voice"
}

Response: Raw audio data (wav format)

POST /synthesize_stream

Stream speech synthesis (for real-time applications).

Request: Same as /synthesize

Response: Streaming audio data

Stack Voice Integration

voice_chat(prompt_audio_path, voice_name)

Complete voice conversation workflow.

Parameters:

  • prompt_audio_path: Path to input audio file
  • voice_name: Name of the voice model to use

Returns: Audio data of the response

voice_command(command, voice_name)

Execute a voice command and get spoken response.

Parameters:

  • command: Voice command string
  • voice_name: Name of the voice model to use

Returns: Audio data of the response

streaming_voice_chat(prompt_audio_path, voice_name)

Real-time streaming voice conversation.

Parameters: Same as voice_chat

Example Workflows

1. Basic Voice Chat

from stack_voice_integration import StackWithVoice

# Initialize integration
stack_voice = StackWithVoice(
    stack_api_url="http://localhost:5000",
    voice_api_url="http://localhost:8000"
)

# Start voice conversation
response_audio = stack_voice.voice_chat("user_prompt.wav", "default")

2. Voice Command to Code Generation

# Execute voice command
response_audio = stack_voice.voice_command(
    "Create a Python class for a banking system",
    "default"
)

3. Streaming Voice Responses

# Start streaming conversation
stack_voice.streaming_voice_chat("user_prompt.wav", "default")

Performance Notes

Voice Cloning

  • Input format: WAV, MP3 (converted internally)
  • Processing time: ~30 seconds per voice model
  • Model size: ~10-50MB per voice
  • Quality: Depends on input audio quality and duration

Text-to-Speech

  • Processing speed: ~100-200 chars/second
  • Latency: ~1-2 seconds for short responses
  • Audio format: 22kHz WAV (adjustable)
  • Voice quality: Coqui XTTS provides natural-sounding voices

Integration Overhead

  • Total latency: ~3-5 seconds for complete voice chat
  • Memory usage: ~1-2GB for voice models
  • CPU usage: ~20-30% during synthesis

Error Handling

The integration includes comprehensive error handling:

  • Voice cloning failures: Returns descriptive error messages
  • TTS synthesis errors: Falls back to default voice
  • API connection issues: Implements retry logic
  • Audio format errors: Automatic format conversion

Security Considerations

  • Audio data: Processed locally, not stored permanently
  • Voice models: Encrypted at rest
  • API authentication: Implement API keys in production
  • Input validation: All user inputs are sanitized

Troubleshooting

Common Issues

  1. Voice cloning fails:

    • Ensure audio quality is good (clear speech, minimal background noise)
    • Check that audio duration is at least 30 seconds
    • Verify input format is supported
  2. TTS synthesis is slow:

    • Check GPU availability for acceleration
    • Reduce audio quality settings
    • Optimize model loading
  3. API connection errors:

    • Verify all services are running
    • Check network connectivity
    • Review firewall settings

Debug Mode

Enable debug logging for detailed output:

import logging
logging.basicConfig(level=logging.DEBUG)

Future Enhancements

  • Real-time speech-to-text integration
  • Multi-language support
  • Voice activity detection
  • Adaptive bitrate streaming
  • Voice emotion and intonation control
  • Batch voice processing
  • Cloud voice model storage

License

This project is part of the Stack 2.9 voice integration ecosystem.

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the API documentation
  3. Enable debug logging for detailed error information