commitguard-env / scripts /gce_vm_runbook.md
Nitishkumar-ai's picture
Deployment Build (Final): Professional Structure + Blog
95cbc5b

GCE VM Runbook — CommitGuard GRPO Training

Step 1: Create VM

Run from your local machine (or use GCP Console):

# Option A: L4 (24 GB VRAM, ~$0.70/hr) — RECOMMENDED
gcloud compute instances create commitguard-train \
  --zone=us-central1-a \
  --machine-type=g2-standard-8 \
  --accelerator=type=nvidia-l4,count=1 \
  --boot-disk-size=100GB \
  --image-family=pytorch-latest-gpu \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --metadata="install-nvidia-driver=True"

# Option B: A100 (40 GB VRAM, ~$2.50/hr) — if L4 unavailable
gcloud compute instances create commitguard-train \
  --zone=us-central1-a \
  --machine-type=a2-highgpu-1g \
  --accelerator=type=nvidia-tesla-a100,count=1 \
  --boot-disk-size=100GB \
  --image-family=pytorch-latest-gpu \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --metadata="install-nvidia-driver=True"

# Option C: T4 (16 GB VRAM, ~$0.35/hr) — budget fallback
gcloud compute instances create commitguard-train \
  --zone=us-central1-b \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --boot-disk-size=100GB \
  --image-family=pytorch-latest-gpu \
  --image-project=deeplearning-platform-release \
  --maintenance-policy=TERMINATE \
  --metadata="install-nvidia-driver=True"

Step 2: SSH into VM

gcloud compute ssh commitguard-train --zone=us-central1-a

Step 3: One-command setup

curl -sSL https://raw.githubusercontent.com/NitishKumar-ai/commitguard/main/scripts/gcp_setup.sh | bash

Or manually:

git clone https://github.com/NitishKumar-ai/commitguard.git
cd commitguard
bash scripts/gcp_setup.sh

Step 4: Start env server (in tmux)

cd ~/commitguard && source .venv/bin/activate
tmux new -s server
server
# Ctrl-B D to detach

Verify:

curl -s http://localhost:8000/health
# → {"status":"healthy"}

Step 5: Login to HuggingFace + Wandb

source ~/commitguard/.venv/bin/activate
huggingface-cli login          # paste your HF token (needed for Llama gated model)
wandb login                    # paste your wandb API key

Step 6: Start training

cd ~/commitguard && source .venv/bin/activate
export WANDB_PROJECT=commitguard

# Full run (~2-3 hours on L4)
python scripts/train_grpo.py \
  --samples 200 \
  --max-steps 300 \
  --save-steps 50 \
  --num-generations 4 \
  --batch-size 1 \
  --grad-accum 4

# Quick smoke test first (5 min)
python scripts/train_grpo.py \
  --samples 20 \
  --max-steps 10 \
  --no-wandb

Step 7: Monitor

# In another tmux pane:
watch -n 30 nvidia-smi          # GPU memory
# Wandb dashboard: https://wandb.ai/<your-user>/commitguard

Step 8: Copy results back

# From your LOCAL machine:
gcloud compute scp --recurse \
  commitguard-train:~/commitguard/outputs/commitguard-llama-3b/final \
  ./outputs/commitguard-llama-3b/final \
  --zone=us-central1-a

Step 9: Shut down VM

gcloud compute instances stop commitguard-train --zone=us-central1-a
# or delete to stop billing entirely:
gcloud compute instances delete commitguard-train --zone=us-central1-a

Cost estimate

GPU VRAM $/hr 300 steps (~3hr)
T4 16GB $0.35 ~$1.05
L4 24GB $0.70 ~$2.10
A100 40GB $2.50 ~$7.50

Troubleshooting