commitguard / scripts /gce_vm_runbook.md
Nitishkumar-ai's picture
Upload folder using huggingface_hub
e4f3d12 verified
## GCE VM Runbook β€” CommitGuard GRPO Training
### Step 1: Create VM
Run from your local machine (or use GCP Console):
```bash
# Option A: L4 (24 GB VRAM, ~$0.70/hr) β€” RECOMMENDED
gcloud compute instances create commitguard-train \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--boot-disk-size=100GB \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
# Option B: A100 (40 GB VRAM, ~$2.50/hr) β€” if L4 unavailable
gcloud compute instances create commitguard-train \
--zone=us-central1-a \
--machine-type=a2-highgpu-1g \
--accelerator=type=nvidia-tesla-a100,count=1 \
--boot-disk-size=100GB \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
# Option C: T4 (16 GB VRAM, ~$0.35/hr) β€” budget fallback
gcloud compute instances create commitguard-train \
--zone=us-central1-b \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--boot-disk-size=100GB \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
```
### Step 2: SSH into VM
```bash
gcloud compute ssh commitguard-train --zone=us-central1-a
```
### Step 3: One-command setup
```bash
curl -sSL https://raw.githubusercontent.com/NitishKumar-ai/commitguard/main/scripts/gcp_setup.sh | bash
```
Or manually:
```bash
git clone https://github.com/NitishKumar-ai/commitguard.git
cd commitguard
bash scripts/gcp_setup.sh
```
### Step 4: Start env server (in tmux)
```bash
cd ~/commitguard && source .venv/bin/activate
tmux new -s server
server
# Ctrl-B D to detach
```
Verify:
```bash
curl -s http://localhost:8000/health
# β†’ {"status":"healthy"}
```
### Step 5: Login to HuggingFace + Wandb
```bash
source ~/commitguard/.venv/bin/activate
huggingface-cli login # paste your HF token (needed for Llama gated model)
wandb login # paste your wandb API key
```
### Step 6: Start training
```bash
cd ~/commitguard && source .venv/bin/activate
export WANDB_PROJECT=commitguard
# Full run (~2-3 hours on L4)
python scripts/train_grpo.py \
--samples 200 \
--max-steps 300 \
--save-steps 50 \
--num-generations 4 \
--batch-size 1 \
--grad-accum 4
# Quick smoke test first (5 min)
python scripts/train_grpo.py \
--samples 20 \
--max-steps 10 \
--no-wandb
```
### Step 7: Monitor
```bash
# In another tmux pane:
watch -n 30 nvidia-smi # GPU memory
# Wandb dashboard: https://wandb.ai/<your-user>/commitguard
```
### Step 8: Copy results back
```bash
# From your LOCAL machine:
gcloud compute scp --recurse \
commitguard-train:~/commitguard/outputs/commitguard-llama-3b/final \
./outputs/commitguard-llama-3b/final \
--zone=us-central1-a
```
### Step 9: Shut down VM
```bash
gcloud compute instances stop commitguard-train --zone=us-central1-a
# or delete to stop billing entirely:
gcloud compute instances delete commitguard-train --zone=us-central1-a
```
### Cost estimate
| GPU | VRAM | $/hr | 300 steps (~3hr) |
|-----|------|------|-------------------|
| T4 | 16GB | $0.35 | ~$1.05 |
| L4 | 24GB | $0.70 | ~$2.10 |
| A100| 40GB | $2.50 | ~$7.50 |
### Troubleshooting
- **OOM on T4**: reduce `--num-generations 2` and `--batch-size 1`
- **Llama access denied**: make sure you accepted the license at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- **Env server not responding**: check `tmux attach -t server` for errors
- **Wandb not logging**: verify `wandb login` succeeded, or use `--no-wandb`
- **GPU quota error**: request GPU quota increase at https://console.cloud.google.com/iam-admin/quotas