Deployment Troubleshooting Guide
Quick Diagnostic
Run the health check first:
curl http://localhost:8000/health
Or use Python:
python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/health').read())"
Check logs:
docker-compose logs -f vllm
# or
tail -f logs/vllm.log
Common Issues and Solutions
1. Docker/Compose Issues
Problem: docker: command not found
Error: Docker is not installed or not in PATH.
Solution:
# Install Docker (Ubuntu/Debian)
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker $USER
# Log out and back in
# Install Docker Compose
sudo apt-get install docker-compose-plugin
# or download binary: https://github.com/docker/compose/releases
Problem: Cannot connect to the Docker daemon
Error: Permission denied or socket not found.
Solution:
# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker
# Verify permissions
docker info
Problem: nvidia: driver not installed or GPU not detected
Error: Docker doesn't see NVIDIA GPU.
Solution:
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
# Verify
docker run --rm --gpus all nvidia/cuda:11.8-base nvidia-smi
2. vLLM Service Issues
Problem: GPU Out of Memory (OOM)
Error in logs: CUDA out of memory or CUDA error: out of memory
Solution:
- Reduce model memory usage via environment variables:
export GPU_MEMORY_UTILIZATION=0.7 # Lower from 0.9
export MAX_MODEL_LEN=8192 # Reduce from 131072
export BLOCK_SIZE=16 # Smaller blocks
Use quantized model (recommended):
- Convert model to AWQ or GGUF format
- Set
QUANTIZATION=awqin environment
Use smaller model: Switch from Llama-3.1-8B to 7B or smaller
Reduce batch size:
export MAX_BATCH_SIZE=4
- Ensure no other processes are using GPU:
nvidia-smi # Check for other processes
Problem: Model not found
Error: Model fails to load, FileNotFoundError, or stays in loading state.
Solution:
- Check model path:
# For local model:
ls -la models/
# Should contain config.json, pytorch_model.bin, etc.
# For HuggingFace model:
# Set MODEL_NAME to HF name, e.g., meta-llama/Llama-3.1-8B-Instruct
- Download model manually if automatic download fails:
# Install huggingface-cli
pip install huggingface-hub
# Download (requires authentication for gated models)
huggingface-cli login # if needed
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir models
- Check disk space:
df -h
# Need ~16GB for 8B model (32GB for original, ~8GB for quantized)
- Use pre-downloaded model:
- Upload model to the
models/directory before starting - Mount external volume with model
- Upload model to the
Problem: Health check timeout or 503 Service Unavailable
Cause: Model still loading, or failed to start.
Diagnosis:
docker-compose logs vllm
# Look for "Model loaded successfully" or error messages
Solution:
- Wait longer (first load can take 5-15 minutes)
- Check logs for specific errors (OOM, missing files)
- Increase healthcheck start_period:
healthcheck:
start_period: 300s # Increase from 120s
Problem: CORS or network errors when calling API
Symptoms: Connection refused, network timeout.
Solution:
# Check if container is running
docker-compose ps
# Check port mapping
docker-compose port vllm 8000
# Test from inside container
docker-compose exec vllm curl http://localhost:8000/health
# Check firewall
sudo ufw status
sudo ufw allow 8000
Problem: Redis connection failed
Error: Could not connect to Redis
Solution:
- Redis is optional (caching). vLLM will continue without it.
- If you want Redis:
docker-compose ps redis # Check if running
docker-compose logs redis
3. Docker Compose Issues
Problem: Port already in use
Error: Bind for 0.0.0.0:8000 failed: port is already allocated
Solution:
# Find process using port
lsof -i :8000
# or
netstat -tulpn | grep :8000
# Kill process or change port in docker-compose.yml:
# ports:
# - "8001:8000" # Map host 8001 to container 8000
Problem: Volume mount permission denied
Error: Cannot mount ./models:/models
Solution:
# Create directories with proper permissions
mkdir -p models logs
sudo chown -R $(id -u):$(id -g) models logs
# Or run Docker with volume flags to ignore permissions
Problem: docker-compose: command not found
Solution:
# Docker Compose v2 (included with Docker)
sudo apt-get install docker-compose-plugin
# Or Docker Compose v1 (standalone)
sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
4. Cloud Deployment Issues
RunPod Specific
Problem: runpodctl: command not found
# Install
curl -L https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64 -o runpodctl
sudo install runpodctl /usr/local/bin/
runpodctl config # Set API key
Problem: Template not found or pod creation failed
- Ensure you have sufficient quota/balance
- Check GPU availability in your region
- Verify template name (case-sensitive)
Problem: SCP/SSH connection failed
- Pod may still be starting; wait 2-3 minutes
- Check pod status:
runpodctl get pod <id> - Verify pod is in
RUNNINGstate
Problem: Insufficient disk space on pod
- Increase disk size in script (
DISK_SIZE=100or higher) - Upload model separately to
/workspace/modelsbefore starting
Vast.ai Specific
Problem: vastai: command not found
pip install vastai
# or download from: https://vast.ai/docs/cli
Problem: No suitable instance found
- Relax search criteria (lower
VAST_GPU_RAM) - Increase
VAST_SEARCH_LIMIT - Check marketplace manually:
vastai search offers "cuda>=11.8"
Problem: SSH connection refused
- Instance may still be provisioning
- Check
vastai show instance <id> - Ensure port forwarding is set up correctly
Problem: Instance died or unresponsive
- Check if balance depleted
- Instance may have been evicted (low priority)
- Use
--priorityflag or choose higher-cost instances
Performance Tuning
Reduce Latency
export MAX_BATCH_SIZE=4 # Smaller batches for lower latency
export MAX_MODEL_LEN=4096 # Shorter context window
export GPU_MEMORY_UTILIZATION=0.8
Increase Throughput
export MAX_BATCH_SIZE=32 # Larger batches
export MAX_MODEL_LEN=16384 # Longer context capability
export GPU_MEMORY_UTILIZATION=0.95
Multi-GPU Setup
# Automatically detected. Ensure tensor parallel size matches GPU count:
# export TENSOR_PARALLEL_SIZE=2 # For 2 GPUs (usually auto-detected)
Monitoring
Health Endpoint
curl http://localhost:8000/health | jq
# Returns: {"status":"healthy","model":{...},"timestamp":...}
Readiness Endpoint (K8s liveness)
curl http://localhost:8000/ready
# Returns: {"status":"ready"}
Prometheus Metrics
curl http://localhost:9090/metrics
# Look for: vllm_requests_total, vllm_request_latency_seconds
Container Logs
# All logs
docker-compose logs -f vllm
# Last 100 lines
docker-compose logs --tail=100 vllm
# Search for errors
docker-compose logs vllm | grep -i error
Model Compatibility
Supported Formats
- HuggingFace (default):
MODEL_FORMAT=hf - Local directory: Mount model folder to
/models - AWQ quantized: Set
QUANTIZATION=awqand use AWQ model
Gated Models (Llama 3.1, etc.)
- Request access on HuggingFace
- Get your token: https://huggingface.co/settings/tokens
- Authenticate:
huggingface-cli login
# Paste token
Unsupported Models
If vLLM doesn't support your model architecture:
- Use
trust_remote_code=True(already set) - Convert model to supported format
- Check vLLM supported models: https://docs.vllm.ai/
Debug Mode
Enable verbose logging:
export LOG_LEVEL=DEBUG
# restart services
docker-compose down && docker-compose up -d
Getting Help
- Check this guide for common symptoms
- Review logs:
docker-compose logs vllm - Search issues: https://github.com/vllm-project/vllm/issues
- Community: https://discord.gg/vllm
Quick Reference Commands
# Start deployment
cd stack-2.9-deploy
./local_deploy.sh
# Stop deployment
docker-compose down
# View logs
docker-compose logs -f vllm
# Restart single service
docker-compose restart vllm
# Check service status
docker-compose ps
# Access container shell
docker-compose exec vllm bash
# Clean everything (WARNING: deletes data!)
docker-compose down -v
rm -rf models logs
# Rebuild image (after Dockerfile changes)
docker-compose build --no-cache vllm
docker-compose up -d