# Deployment Troubleshooting Guide ## Quick Diagnostic Run the health check first: ```bash curl http://localhost:8000/health ``` Or use Python: ```bash python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/health').read())" ``` Check logs: ```bash docker-compose logs -f vllm # or tail -f logs/vllm.log ``` --- ## Common Issues and Solutions ### 1. Docker/Compose Issues #### Problem: `docker: command not found` **Error:** Docker is not installed or not in PATH. **Solution:** ```bash # Install Docker (Ubuntu/Debian) curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo usermod -aG docker $USER # Log out and back in # Install Docker Compose sudo apt-get install docker-compose-plugin # or download binary: https://github.com/docker/compose/releases ``` #### Problem: `Cannot connect to the Docker daemon` **Error:** Permission denied or socket not found. **Solution:** ```bash # Start Docker service sudo systemctl start docker sudo systemctl enable docker # Verify permissions docker info ``` #### Problem: `nvidia: driver not installed` or GPU not detected **Error:** Docker doesn't see NVIDIA GPU. **Solution:** ```bash # Install NVIDIA Container Toolkit distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker # Verify docker run --rm --gpus all nvidia/cuda:11.8-base nvidia-smi ``` --- ### 2. vLLM Service Issues #### Problem: `GPU Out of Memory (OOM)` **Error in logs:** `CUDA out of memory` or `CUDA error: out of memory` **Solution:** 1. **Reduce model memory usage** via environment variables: ```bash export GPU_MEMORY_UTILIZATION=0.7 # Lower from 0.9 export MAX_MODEL_LEN=8192 # Reduce from 131072 export BLOCK_SIZE=16 # Smaller blocks ``` 2. **Use quantized model** (recommended): - Convert model to AWQ or GGUF format - Set `QUANTIZATION=awq` in environment 3. **Use smaller model**: Switch from Llama-3.1-8B to 7B or smaller 4. **Reduce batch size**: ```bash export MAX_BATCH_SIZE=4 ``` 5. **Ensure no other processes** are using GPU: ```bash nvidia-smi # Check for other processes ``` #### Problem: `Model not found` **Error:** Model fails to load, `FileNotFoundError`, or stays in loading state. **Solution:** 1. **Check model path**: ```bash # For local model: ls -la models/ # Should contain config.json, pytorch_model.bin, etc. # For HuggingFace model: # Set MODEL_NAME to HF name, e.g., meta-llama/Llama-3.1-8B-Instruct ``` 2. **Download model manually** if automatic download fails: ```bash # Install huggingface-cli pip install huggingface-hub # Download (requires authentication for gated models) huggingface-cli login # if needed huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir models ``` 3. **Check disk space**: ```bash df -h # Need ~16GB for 8B model (32GB for original, ~8GB for quantized) ``` 4. **Use pre-downloaded model**: - Upload model to the `models/` directory before starting - Mount external volume with model #### Problem: `Health check timeout` or `503 Service Unavailable` **Cause:** Model still loading, or failed to start. **Diagnosis:** ```bash docker-compose logs vllm # Look for "Model loaded successfully" or error messages ``` **Solution:** - Wait longer (first load can take 5-15 minutes) - Check logs for specific errors (OOM, missing files) - Increase healthcheck start_period: ```yaml healthcheck: start_period: 300s # Increase from 120s ``` #### Problem: `CORS or network errors` when calling API **Symptoms:** Connection refused, network timeout. **Solution:** ```bash # Check if container is running docker-compose ps # Check port mapping docker-compose port vllm 8000 # Test from inside container docker-compose exec vllm curl http://localhost:8000/health # Check firewall sudo ufw status sudo ufw allow 8000 ``` #### Problem: `Redis connection failed` **Error:** `Could not connect to Redis` **Solution:** - Redis is optional (caching). vLLM will continue without it. - If you want Redis: ```bash docker-compose ps redis # Check if running docker-compose logs redis ``` --- ### 3. Docker Compose Issues #### Problem: `Port already in use` **Error:** ` Bind for 0.0.0.0:8000 failed: port is already allocated` **Solution:** ```bash # Find process using port lsof -i :8000 # or netstat -tulpn | grep :8000 # Kill process or change port in docker-compose.yml: # ports: # - "8001:8000" # Map host 8001 to container 8000 ``` #### Problem: `Volume mount permission denied` **Error:** Cannot mount `./models:/models` **Solution:** ```bash # Create directories with proper permissions mkdir -p models logs sudo chown -R $(id -u):$(id -g) models logs # Or run Docker with volume flags to ignore permissions ``` #### Problem: `docker-compose: command not found` **Solution:** ```bash # Docker Compose v2 (included with Docker) sudo apt-get install docker-compose-plugin # Or Docker Compose v1 (standalone) sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose ``` --- ### 4. Cloud Deployment Issues #### RunPod Specific **Problem: `runpodctl: command not found`** ```bash # Install curl -L https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64 -o runpodctl sudo install runpodctl /usr/local/bin/ runpodctl config # Set API key ``` **Problem: `Template not found` or `pod creation failed`** - Ensure you have sufficient quota/balance - Check GPU availability in your region - Verify template name (case-sensitive) **Problem: `SCP/SSH connection failed`** - Pod may still be starting; wait 2-3 minutes - Check pod status: `runpodctl get pod ` - Verify pod is in `RUNNING` state **Problem: `Insufficient disk space` on pod** - Increase disk size in script (`DISK_SIZE=100` or higher) - Upload model separately to `/workspace/models` before starting #### Vast.ai Specific **Problem: `vastai: command not found`** ```bash pip install vastai # or download from: https://vast.ai/docs/cli ``` **Problem: `No suitable instance found`** - Relax search criteria (lower `VAST_GPU_RAM`) - Increase `VAST_SEARCH_LIMIT` - Check marketplace manually: `vastai search offers "cuda>=11.8"` **Problem: `SSH connection refused`** - Instance may still be provisioning - Check `vastai show instance ` - Ensure port forwarding is set up correctly **Problem: `Instance died or unresponsive`** - Check if balance depleted - Instance may have been evicted (low priority) - Use `--priority` flag or choose higher-cost instances --- ## Performance Tuning ### Reduce Latency ```bash export MAX_BATCH_SIZE=4 # Smaller batches for lower latency export MAX_MODEL_LEN=4096 # Shorter context window export GPU_MEMORY_UTILIZATION=0.8 ``` ### Increase Throughput ```bash export MAX_BATCH_SIZE=32 # Larger batches export MAX_MODEL_LEN=16384 # Longer context capability export GPU_MEMORY_UTILIZATION=0.95 ``` ### Multi-GPU Setup ```bash # Automatically detected. Ensure tensor parallel size matches GPU count: # export TENSOR_PARALLEL_SIZE=2 # For 2 GPUs (usually auto-detected) ``` --- ## Monitoring ### Health Endpoint ```bash curl http://localhost:8000/health | jq # Returns: {"status":"healthy","model":{...},"timestamp":...} ``` ### Readiness Endpoint (K8s liveness) ```bash curl http://localhost:8000/ready # Returns: {"status":"ready"} ``` ### Prometheus Metrics ```bash curl http://localhost:9090/metrics # Look for: vllm_requests_total, vllm_request_latency_seconds ``` ### Container Logs ```bash # All logs docker-compose logs -f vllm # Last 100 lines docker-compose logs --tail=100 vllm # Search for errors docker-compose logs vllm | grep -i error ``` --- ## Model Compatibility ### Supported Formats - **HuggingFace (default)**: `MODEL_FORMAT=hf` - **Local directory**: Mount model folder to `/models` - **AWQ quantized**: Set `QUANTIZATION=awq` and use AWQ model ### Gated Models (Llama 3.1, etc.) 1. Request access on HuggingFace 2. Get your token: https://huggingface.co/settings/tokens 3. Authenticate: ```bash huggingface-cli login # Paste token ``` ### Unsupported Models If vLLM doesn't support your model architecture: - Use `trust_remote_code=True` (already set) - Convert model to supported format - Check vLLM supported models: https://docs.vllm.ai/ --- ## Debug Mode Enable verbose logging: ```bash export LOG_LEVEL=DEBUG # restart services docker-compose down && docker-compose up -d ``` --- ## Getting Help 1. Check this guide for common symptoms 2. Review logs: `docker-compose logs vllm` 3. Search issues: https://github.com/vllm-project/vllm/issues 4. Community: https://discord.gg/vllm --- ## Quick Reference Commands ```bash # Start deployment cd stack-2.9-deploy ./local_deploy.sh # Stop deployment docker-compose down # View logs docker-compose logs -f vllm # Restart single service docker-compose restart vllm # Check service status docker-compose ps # Access container shell docker-compose exec vllm bash # Clean everything (WARNING: deletes data!) docker-compose down -v rm -rf models logs # Rebuild image (after Dockerfile changes) docker-compose build --no-cache vllm docker-compose up -d ```