walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 22 days ago

6.47 kB

Stack 2.9 Voice Integration Module

A comprehensive voice integration module that connects the Stack 2.9 coding assistant with voice cloning and text-to-speech capabilities.

Architecture Overview

This integration provides a complete voice-enabled coding assistant workflow:

Voice Input → Speech-to-Text → Stack 2.9 API → Text Response → Text-to-Speech → Voice Output
     ↑                                                                       ↓
Voice Cloning ← Voice Models ← FastAPI Service ← Python Client ← Integration Layer

Core Components

voice_server.py - FastAPI voice service with endpoints for:
- POST /clone - Clone voice from audio samples
- POST /synthesize - Text-to-speech with cloned voices
- GET /voices - List available voice models
voice_client.py - Python client for interacting with the voice API
stack_voice_integration.py - Main integration with Stack 2.9
- voice_chat() - Complete voice conversation workflow
- voice_command() - Voice command execution
- streaming_voice_chat() - Real-time voice streaming
integration_example.py - Usage examples and demonstrations

Setup Instructions

Prerequisites

Python 3.8+
Docker & Docker Compose
Coqui TTS (for voice synthesis)
Optional: Vosk (for speech-to-text)

Installation

Clone the voice models directory:
```
mkdir -p voice_models audio_files
```

Install Python dependencies:

pip install fastapi uvicorn requests pydantic

For GPU support (optional):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Running the Services

Start the voice services:
```
docker-compose up -d
```

Start the FastAPI server:

cd stack-2.9-voice
uvicorn voice_server:app --host 0.0.0.0 --port 8000 --reload

Test the API:
```
curl http://localhost:8000/voices
```

API Reference

Voice Server API

`GET /voices`

List all available voice models.

Response:

{
  "voices": ["default", "custom_voice"],
  "count": 2
}

`POST /clone`

Clone a voice from an audio sample.

Request:

{
  "voice_name": "my_custom_voice"
}

Response:

{
  "success": true,
  "voice_name": "my_custom_voice",
  "message": "Voice model created successfully"
}

`POST /synthesize`

Generate speech with a cloned voice.

Request:

{
  "text": "Hello, this is a test.",
  "voice_name": "my_custom_voice"
}

Response: Raw audio data (wav format)

`POST /synthesize_stream`

Stream speech synthesis (for real-time applications).

Request: Same as /synthesize

Response: Streaming audio data

Stack Voice Integration

`voice_chat(prompt_audio_path, voice_name)`

Complete voice conversation workflow.

Parameters:

prompt_audio_path: Path to input audio file
voice_name: Name of the voice model to use

Returns: Audio data of the response

`voice_command(command, voice_name)`

Execute a voice command and get spoken response.

Parameters:

command: Voice command string
voice_name: Name of the voice model to use

Returns: Audio data of the response

`streaming_voice_chat(prompt_audio_path, voice_name)`

Real-time streaming voice conversation.

Parameters: Same as voice_chat

Example Workflows

1. Basic Voice Chat

from stack_voice_integration import StackWithVoice

# Initialize integration
stack_voice = StackWithVoice(
    stack_api_url="http://localhost:5000",
    voice_api_url="http://localhost:8000"
)

# Start voice conversation
response_audio = stack_voice.voice_chat("user_prompt.wav", "default")

2. Voice Command to Code Generation

# Execute voice command
response_audio = stack_voice.voice_command(
    "Create a Python class for a banking system",
    "default"
)

3. Streaming Voice Responses

# Start streaming conversation
stack_voice.streaming_voice_chat("user_prompt.wav", "default")

Performance Notes

Voice Cloning

Input format: WAV, MP3 (converted internally)
Processing time: ~30 seconds per voice model
Model size: ~10-50MB per voice
Quality: Depends on input audio quality and duration

Text-to-Speech

Processing speed: ~100-200 chars/second
Latency: ~1-2 seconds for short responses
Audio format: 22kHz WAV (adjustable)
Voice quality: Coqui XTTS provides natural-sounding voices

Integration Overhead

Total latency: ~3-5 seconds for complete voice chat
Memory usage: ~1-2GB for voice models
CPU usage: ~20-30% during synthesis

Error Handling

The integration includes comprehensive error handling:

Voice cloning failures: Returns descriptive error messages
TTS synthesis errors: Falls back to default voice
API connection issues: Implements retry logic
Audio format errors: Automatic format conversion

Security Considerations

Audio data: Processed locally, not stored permanently
Voice models: Encrypted at rest
API authentication: Implement API keys in production
Input validation: All user inputs are sanitized

Troubleshooting

Common Issues

Voice cloning fails:
- Ensure audio quality is good (clear speech, minimal background noise)
- Check that audio duration is at least 30 seconds
- Verify input format is supported
TTS synthesis is slow:
- Check GPU availability for acceleration
- Reduce audio quality settings
- Optimize model loading
API connection errors:
- Verify all services are running
- Check network connectivity
- Review firewall settings

Debug Mode

Enable debug logging for detailed output:

import logging
logging.basicConfig(level=logging.DEBUG)

Future Enhancements

Real-time speech-to-text integration
Multi-language support
Voice activity detection
Adaptive bitrate streaming
Voice emotion and intonation control
Batch voice processing
Cloud voice model storage

License

This project is part of the Stack 2.9 voice integration ecosystem.

Support

For issues and questions:

Check the troubleshooting section
Review the API documentation
Enable debug logging for detailed error information