| # Deployment Troubleshooting Guide |
|
|
| ## Quick Diagnostic |
|
|
| Run the health check first: |
| ```bash |
| curl http://localhost:8000/health |
| ``` |
|
|
| Or use Python: |
| ```bash |
| python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/health').read())" |
| ``` |
|
|
| Check logs: |
| ```bash |
| docker-compose logs -f vllm |
| # or |
| tail -f logs/vllm.log |
| ``` |
|
|
| --- |
|
|
| ## Common Issues and Solutions |
|
|
| ### 1. Docker/Compose Issues |
|
|
| #### Problem: `docker: command not found` |
| **Error:** Docker is not installed or not in PATH. |
|
|
| **Solution:** |
| ```bash |
| # Install Docker (Ubuntu/Debian) |
| curl -fsSL https://get.docker.com -o get-docker.sh |
| sudo sh get-docker.sh |
| sudo usermod -aG docker $USER |
| # Log out and back in |
| |
| # Install Docker Compose |
| sudo apt-get install docker-compose-plugin |
| # or download binary: https://github.com/docker/compose/releases |
| ``` |
|
|
| #### Problem: `Cannot connect to the Docker daemon` |
| **Error:** Permission denied or socket not found. |
|
|
| **Solution:** |
| ```bash |
| # Start Docker service |
| sudo systemctl start docker |
| sudo systemctl enable docker |
| |
| # Verify permissions |
| docker info |
| ``` |
|
|
| #### Problem: `nvidia: driver not installed` or GPU not detected |
| **Error:** Docker doesn't see NVIDIA GPU. |
|
|
| **Solution:** |
| ```bash |
| # Install NVIDIA Container Toolkit |
| distribution=$(. /etc/os-release;echo $ID$VERSION_ID) |
| curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - |
| curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list |
| sudo apt-get update && sudo apt-get install -y nvidia-docker2 |
| sudo systemctl restart docker |
| |
| # Verify |
| docker run --rm --gpus all nvidia/cuda:11.8-base nvidia-smi |
| ``` |
|
|
| --- |
|
|
| ### 2. vLLM Service Issues |
|
|
| #### Problem: `GPU Out of Memory (OOM)` |
| **Error in logs:** `CUDA out of memory` or `CUDA error: out of memory` |
|
|
| **Solution:** |
|
|
| 1. **Reduce model memory usage** via environment variables: |
| ```bash |
| export GPU_MEMORY_UTILIZATION=0.7 # Lower from 0.9 |
| export MAX_MODEL_LEN=8192 # Reduce from 131072 |
| export BLOCK_SIZE=16 # Smaller blocks |
| ``` |
|
|
| 2. **Use quantized model** (recommended): |
| - Convert model to AWQ or GGUF format |
| - Set `QUANTIZATION=awq` in environment |
|
|
| 3. **Use smaller model**: Switch from Llama-3.1-8B to 7B or smaller |
|
|
| 4. **Reduce batch size**: |
| ```bash |
| export MAX_BATCH_SIZE=4 |
| ``` |
|
|
| 5. **Ensure no other processes** are using GPU: |
| ```bash |
| nvidia-smi # Check for other processes |
| ``` |
|
|
| #### Problem: `Model not found` |
| **Error:** Model fails to load, `FileNotFoundError`, or stays in loading state. |
|
|
| **Solution:** |
|
|
| 1. **Check model path**: |
| ```bash |
| # For local model: |
| ls -la models/ |
| # Should contain config.json, pytorch_model.bin, etc. |
| |
| # For HuggingFace model: |
| # Set MODEL_NAME to HF name, e.g., meta-llama/Llama-3.1-8B-Instruct |
| ``` |
|
|
| 2. **Download model manually** if automatic download fails: |
| ```bash |
| # Install huggingface-cli |
| pip install huggingface-hub |
| |
| # Download (requires authentication for gated models) |
| huggingface-cli login # if needed |
| huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir models |
| ``` |
|
|
| 3. **Check disk space**: |
| ```bash |
| df -h |
| # Need ~16GB for 8B model (32GB for original, ~8GB for quantized) |
| ``` |
|
|
| 4. **Use pre-downloaded model**: |
| - Upload model to the `models/` directory before starting |
| - Mount external volume with model |
|
|
| #### Problem: `Health check timeout` or `503 Service Unavailable` |
| **Cause:** Model still loading, or failed to start. |
|
|
| **Diagnosis:** |
| ```bash |
| docker-compose logs vllm |
| # Look for "Model loaded successfully" or error messages |
| ``` |
|
|
| **Solution:** |
| - Wait longer (first load can take 5-15 minutes) |
| - Check logs for specific errors (OOM, missing files) |
| - Increase healthcheck start_period: |
| ```yaml |
| healthcheck: |
| start_period: 300s # Increase from 120s |
| ``` |
| |
| #### Problem: `CORS or network errors` when calling API |
| **Symptoms:** Connection refused, network timeout. |
| |
| **Solution:** |
| ```bash |
| # Check if container is running |
| docker-compose ps |
|
|
| # Check port mapping |
| docker-compose port vllm 8000 |
|
|
| # Test from inside container |
| docker-compose exec vllm curl http://localhost:8000/health |
|
|
| # Check firewall |
| sudo ufw status |
| sudo ufw allow 8000 |
| ``` |
| |
| #### Problem: `Redis connection failed` |
| **Error:** `Could not connect to Redis` |
| |
| **Solution:** |
| - Redis is optional (caching). vLLM will continue without it. |
| - If you want Redis: |
| ```bash |
| docker-compose ps redis # Check if running |
| docker-compose logs redis |
| ``` |
| |
| --- |
| |
| ### 3. Docker Compose Issues |
| |
| #### Problem: `Port already in use` |
| **Error:** ` Bind for 0.0.0.0:8000 failed: port is already allocated` |
| |
| **Solution:** |
| ```bash |
| # Find process using port |
| lsof -i :8000 |
| # or |
| netstat -tulpn | grep :8000 |
|
|
| # Kill process or change port in docker-compose.yml: |
| # ports: |
| # - "8001:8000" # Map host 8001 to container 8000 |
| ``` |
| |
| #### Problem: `Volume mount permission denied` |
| **Error:** Cannot mount `./models:/models` |
| |
| **Solution:** |
| ```bash |
| # Create directories with proper permissions |
| mkdir -p models logs |
| sudo chown -R $(id -u):$(id -g) models logs |
| # Or run Docker with volume flags to ignore permissions |
| ``` |
| |
| #### Problem: `docker-compose: command not found` |
| **Solution:** |
| ```bash |
| # Docker Compose v2 (included with Docker) |
| sudo apt-get install docker-compose-plugin |
|
|
| # Or Docker Compose v1 (standalone) |
| sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose |
| sudo chmod +x /usr/local/bin/docker-compose |
| ``` |
| |
| --- |
| |
| ### 4. Cloud Deployment Issues |
| |
| #### RunPod Specific |
| |
| **Problem: `runpodctl: command not found`** |
| ```bash |
| # Install |
| curl -L https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64 -o runpodctl |
| sudo install runpodctl /usr/local/bin/ |
| runpodctl config # Set API key |
| ``` |
| |
| **Problem: `Template not found` or `pod creation failed`** |
| - Ensure you have sufficient quota/balance |
| - Check GPU availability in your region |
| - Verify template name (case-sensitive) |
| |
| **Problem: `SCP/SSH connection failed`** |
| - Pod may still be starting; wait 2-3 minutes |
| - Check pod status: `runpodctl get pod <id>` |
| - Verify pod is in `RUNNING` state |
| |
| **Problem: `Insufficient disk space` on pod** |
| - Increase disk size in script (`DISK_SIZE=100` or higher) |
| - Upload model separately to `/workspace/models` before starting |
| |
| #### Vast.ai Specific |
| |
| **Problem: `vastai: command not found`** |
| ```bash |
| pip install vastai |
| # or download from: https://vast.ai/docs/cli |
| ``` |
| |
| **Problem: `No suitable instance found`** |
| - Relax search criteria (lower `VAST_GPU_RAM`) |
| - Increase `VAST_SEARCH_LIMIT` |
| - Check marketplace manually: `vastai search offers "cuda>=11.8"` |
| |
| **Problem: `SSH connection refused`** |
| - Instance may still be provisioning |
| - Check `vastai show instance <id>` |
| - Ensure port forwarding is set up correctly |
| |
| **Problem: `Instance died or unresponsive`** |
| - Check if balance depleted |
| - Instance may have been evicted (low priority) |
| - Use `--priority` flag or choose higher-cost instances |
| |
| --- |
| |
| ## Performance Tuning |
| |
| ### Reduce Latency |
| ```bash |
| export MAX_BATCH_SIZE=4 # Smaller batches for lower latency |
| export MAX_MODEL_LEN=4096 # Shorter context window |
| export GPU_MEMORY_UTILIZATION=0.8 |
| ``` |
| |
| ### Increase Throughput |
| ```bash |
| export MAX_BATCH_SIZE=32 # Larger batches |
| export MAX_MODEL_LEN=16384 # Longer context capability |
| export GPU_MEMORY_UTILIZATION=0.95 |
| ``` |
| |
| ### Multi-GPU Setup |
| ```bash |
| # Automatically detected. Ensure tensor parallel size matches GPU count: |
| # export TENSOR_PARALLEL_SIZE=2 # For 2 GPUs (usually auto-detected) |
| ``` |
| |
| --- |
| |
| ## Monitoring |
| |
| ### Health Endpoint |
| ```bash |
| curl http://localhost:8000/health | jq |
| # Returns: {"status":"healthy","model":{...},"timestamp":...} |
| ``` |
| |
| ### Readiness Endpoint (K8s liveness) |
| ```bash |
| curl http://localhost:8000/ready |
| # Returns: {"status":"ready"} |
| ``` |
| |
| ### Prometheus Metrics |
| ```bash |
| curl http://localhost:9090/metrics |
| # Look for: vllm_requests_total, vllm_request_latency_seconds |
| ``` |
| |
| ### Container Logs |
| ```bash |
| # All logs |
| docker-compose logs -f vllm |
| |
| # Last 100 lines |
| docker-compose logs --tail=100 vllm |
| |
| # Search for errors |
| docker-compose logs vllm | grep -i error |
| ``` |
| |
| --- |
| |
| ## Model Compatibility |
| |
| ### Supported Formats |
| - **HuggingFace (default)**: `MODEL_FORMAT=hf` |
| - **Local directory**: Mount model folder to `/models` |
| - **AWQ quantized**: Set `QUANTIZATION=awq` and use AWQ model |
|
|
| ### Gated Models (Llama 3.1, etc.) |
| 1. Request access on HuggingFace |
| 2. Get your token: https://huggingface.co/settings/tokens |
| 3. Authenticate: |
| ```bash |
| huggingface-cli login |
| # Paste token |
| ``` |
|
|
| ### Unsupported Models |
| If vLLM doesn't support your model architecture: |
| - Use `trust_remote_code=True` (already set) |
| - Convert model to supported format |
| - Check vLLM supported models: https://docs.vllm.ai/ |
|
|
| --- |
|
|
| ## Debug Mode |
|
|
| Enable verbose logging: |
| ```bash |
| export LOG_LEVEL=DEBUG |
| # restart services |
| docker-compose down && docker-compose up -d |
| ``` |
|
|
| --- |
|
|
| ## Getting Help |
|
|
| 1. Check this guide for common symptoms |
| 2. Review logs: `docker-compose logs vllm` |
| 3. Search issues: https://github.com/vllm-project/vllm/issues |
| 4. Community: https://discord.gg/vllm |
|
|
| --- |
|
|
| ## Quick Reference Commands |
|
|
| ```bash |
| # Start deployment |
| cd stack-2.9-deploy |
| ./local_deploy.sh |
| |
| # Stop deployment |
| docker-compose down |
| |
| # View logs |
| docker-compose logs -f vllm |
| |
| # Restart single service |
| docker-compose restart vllm |
| |
| # Check service status |
| docker-compose ps |
| |
| # Access container shell |
| docker-compose exec vllm bash |
| |
| # Clean everything (WARNING: deletes data!) |
| docker-compose down -v |
| rm -rf models logs |
| |
| # Rebuild image (after Dockerfile changes) |
| docker-compose build --no-cache vllm |
| docker-compose up -d |
| ``` |
|
|