Spaces:
Configuration error
Configuration error
File size: 3,893 Bytes
e4f3d12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | ## GCE VM Runbook — CommitGuard GRPO Training
### Step 1: Create VM
Run from your local machine (or use GCP Console):
```bash
# Option A: L4 (24 GB VRAM, ~$0.70/hr) — RECOMMENDED
gcloud compute instances create commitguard-train \
--zone=us-central1-a \
--machine-type=g2-standard-8 \
--accelerator=type=nvidia-l4,count=1 \
--boot-disk-size=100GB \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
# Option B: A100 (40 GB VRAM, ~$2.50/hr) — if L4 unavailable
gcloud compute instances create commitguard-train \
--zone=us-central1-a \
--machine-type=a2-highgpu-1g \
--accelerator=type=nvidia-tesla-a100,count=1 \
--boot-disk-size=100GB \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
# Option C: T4 (16 GB VRAM, ~$0.35/hr) — budget fallback
gcloud compute instances create commitguard-train \
--zone=us-central1-b \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--boot-disk-size=100GB \
--image-family=pytorch-latest-gpu \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
```
### Step 2: SSH into VM
```bash
gcloud compute ssh commitguard-train --zone=us-central1-a
```
### Step 3: One-command setup
```bash
curl -sSL https://raw.githubusercontent.com/NitishKumar-ai/commitguard/main/scripts/gcp_setup.sh | bash
```
Or manually:
```bash
git clone https://github.com/NitishKumar-ai/commitguard.git
cd commitguard
bash scripts/gcp_setup.sh
```
### Step 4: Start env server (in tmux)
```bash
cd ~/commitguard && source .venv/bin/activate
tmux new -s server
server
# Ctrl-B D to detach
```
Verify:
```bash
curl -s http://localhost:8000/health
# → {"status":"healthy"}
```
### Step 5: Login to HuggingFace + Wandb
```bash
source ~/commitguard/.venv/bin/activate
huggingface-cli login # paste your HF token (needed for Llama gated model)
wandb login # paste your wandb API key
```
### Step 6: Start training
```bash
cd ~/commitguard && source .venv/bin/activate
export WANDB_PROJECT=commitguard
# Full run (~2-3 hours on L4)
python scripts/train_grpo.py \
--samples 200 \
--max-steps 300 \
--save-steps 50 \
--num-generations 4 \
--batch-size 1 \
--grad-accum 4
# Quick smoke test first (5 min)
python scripts/train_grpo.py \
--samples 20 \
--max-steps 10 \
--no-wandb
```
### Step 7: Monitor
```bash
# In another tmux pane:
watch -n 30 nvidia-smi # GPU memory
# Wandb dashboard: https://wandb.ai/<your-user>/commitguard
```
### Step 8: Copy results back
```bash
# From your LOCAL machine:
gcloud compute scp --recurse \
commitguard-train:~/commitguard/outputs/commitguard-llama-3b/final \
./outputs/commitguard-llama-3b/final \
--zone=us-central1-a
```
### Step 9: Shut down VM
```bash
gcloud compute instances stop commitguard-train --zone=us-central1-a
# or delete to stop billing entirely:
gcloud compute instances delete commitguard-train --zone=us-central1-a
```
### Cost estimate
| GPU | VRAM | $/hr | 300 steps (~3hr) |
|-----|------|------|-------------------|
| T4 | 16GB | $0.35 | ~$1.05 |
| L4 | 24GB | $0.70 | ~$2.10 |
| A100| 40GB | $2.50 | ~$7.50 |
### Troubleshooting
- **OOM on T4**: reduce `--num-generations 2` and `--batch-size 1`
- **Llama access denied**: make sure you accepted the license at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- **Env server not responding**: check `tmux attach -t server` for errors
- **Wandb not logging**: verify `wandb login` succeeded, or use `--no-wandb`
- **GPU quota error**: request GPU quota increase at https://console.cloud.google.com/iam-admin/quotas
|