Spaces:
Configuration error
Configuration error
| ## GCE VM Runbook β CommitGuard GRPO Training | |
| ### Step 1: Create VM | |
| Run from your local machine (or use GCP Console): | |
| ```bash | |
| # Option A: L4 (24 GB VRAM, ~$0.70/hr) β RECOMMENDED | |
| gcloud compute instances create commitguard-train \ | |
| --zone=us-central1-a \ | |
| --machine-type=g2-standard-8 \ | |
| --accelerator=type=nvidia-l4,count=1 \ | |
| --boot-disk-size=100GB \ | |
| --image-family=pytorch-latest-gpu \ | |
| --image-project=deeplearning-platform-release \ | |
| --maintenance-policy=TERMINATE \ | |
| --metadata="install-nvidia-driver=True" | |
| # Option B: A100 (40 GB VRAM, ~$2.50/hr) β if L4 unavailable | |
| gcloud compute instances create commitguard-train \ | |
| --zone=us-central1-a \ | |
| --machine-type=a2-highgpu-1g \ | |
| --accelerator=type=nvidia-tesla-a100,count=1 \ | |
| --boot-disk-size=100GB \ | |
| --image-family=pytorch-latest-gpu \ | |
| --image-project=deeplearning-platform-release \ | |
| --maintenance-policy=TERMINATE \ | |
| --metadata="install-nvidia-driver=True" | |
| # Option C: T4 (16 GB VRAM, ~$0.35/hr) β budget fallback | |
| gcloud compute instances create commitguard-train \ | |
| --zone=us-central1-b \ | |
| --machine-type=n1-standard-8 \ | |
| --accelerator=type=nvidia-tesla-t4,count=1 \ | |
| --boot-disk-size=100GB \ | |
| --image-family=pytorch-latest-gpu \ | |
| --image-project=deeplearning-platform-release \ | |
| --maintenance-policy=TERMINATE \ | |
| --metadata="install-nvidia-driver=True" | |
| ``` | |
| ### Step 2: SSH into VM | |
| ```bash | |
| gcloud compute ssh commitguard-train --zone=us-central1-a | |
| ``` | |
| ### Step 3: One-command setup | |
| ```bash | |
| curl -sSL https://raw.githubusercontent.com/NitishKumar-ai/commitguard/main/scripts/gcp_setup.sh | bash | |
| ``` | |
| Or manually: | |
| ```bash | |
| git clone https://github.com/NitishKumar-ai/commitguard.git | |
| cd commitguard | |
| bash scripts/gcp_setup.sh | |
| ``` | |
| ### Step 4: Start env server (in tmux) | |
| ```bash | |
| cd ~/commitguard && source .venv/bin/activate | |
| tmux new -s server | |
| server | |
| # Ctrl-B D to detach | |
| ``` | |
| Verify: | |
| ```bash | |
| curl -s http://localhost:8000/health | |
| # β {"status":"healthy"} | |
| ``` | |
| ### Step 5: Login to HuggingFace + Wandb | |
| ```bash | |
| source ~/commitguard/.venv/bin/activate | |
| huggingface-cli login # paste your HF token (needed for Llama gated model) | |
| wandb login # paste your wandb API key | |
| ``` | |
| ### Step 6: Start training | |
| ```bash | |
| cd ~/commitguard && source .venv/bin/activate | |
| export WANDB_PROJECT=commitguard | |
| # Full run (~2-3 hours on L4) | |
| python scripts/train_grpo.py \ | |
| --samples 200 \ | |
| --max-steps 300 \ | |
| --save-steps 50 \ | |
| --num-generations 4 \ | |
| --batch-size 1 \ | |
| --grad-accum 4 | |
| # Quick smoke test first (5 min) | |
| python scripts/train_grpo.py \ | |
| --samples 20 \ | |
| --max-steps 10 \ | |
| --no-wandb | |
| ``` | |
| ### Step 7: Monitor | |
| ```bash | |
| # In another tmux pane: | |
| watch -n 30 nvidia-smi # GPU memory | |
| # Wandb dashboard: https://wandb.ai/<your-user>/commitguard | |
| ``` | |
| ### Step 8: Copy results back | |
| ```bash | |
| # From your LOCAL machine: | |
| gcloud compute scp --recurse \ | |
| commitguard-train:~/commitguard/outputs/commitguard-llama-3b/final \ | |
| ./outputs/commitguard-llama-3b/final \ | |
| --zone=us-central1-a | |
| ``` | |
| ### Step 9: Shut down VM | |
| ```bash | |
| gcloud compute instances stop commitguard-train --zone=us-central1-a | |
| # or delete to stop billing entirely: | |
| gcloud compute instances delete commitguard-train --zone=us-central1-a | |
| ``` | |
| ### Cost estimate | |
| | GPU | VRAM | $/hr | 300 steps (~3hr) | | |
| |-----|------|------|-------------------| | |
| | T4 | 16GB | $0.35 | ~$1.05 | | |
| | L4 | 24GB | $0.70 | ~$2.10 | | |
| | A100| 40GB | $2.50 | ~$7.50 | | |
| ### Troubleshooting | |
| - **OOM on T4**: reduce `--num-generations 2` and `--batch-size 1` | |
| - **Llama access denied**: make sure you accepted the license at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct | |
| - **Env server not responding**: check `tmux attach -t server` for errors | |
| - **Wandb not logging**: verify `wandb login` succeeded, or use `--no-wandb` | |
| - **GPU quota error**: request GPU quota increase at https://console.cloud.google.com/iam-admin/quotas | |