Spaces:
Runtime error
Runtime error
File size: 12,919 Bytes
b74db43 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 | # HF Training Checklist β CommitGuard
**Print this. Tick every box in order. Do NOT skip steps.**
**If any box fails: STOP. Fix before proceeding.**
---
## PHASE 0 β Account Setup (Do Once, Do NOW)
- [ ] `huggingface-cli login` β authenticated
- [ ] `huggingface-cli whoami` β shows your username
- [ ] HF credits visible at https://huggingface.co/settings/billing β $30 showing
- [ ] Claim HF credits if not done: https://huggingface.co/coupons/claim/hf-openenv-community
- [ ] Llama-3.2-3B license accepted at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- [ ] License status: "You have been granted access" (NOT "pending")
- [ ] If pending after 30 min β **SWITCH TO Qwen2.5-1.5B-Instruct. No waiting.**
- [ ] `wandb login` β authenticated
- [ ] Wandb project created: `commitguard`
---
## PHASE 1 β Environment Health (Before ANY Training)
### 1A. HF Space is alive
```bash
curl https://<username>-commitguard.hf.space/health
```
- [ ] Returns `{"status": "healthy"}` with HTTP 200
- [ ] Response time < 3 seconds
### 1B. Env accepts actions
```bash
# Reset
curl -X POST https://<username>-commitguard.hf.space/reset
```
- [ ] Returns JSON with `diff` field (non-empty string)
- [ ] Returns JSON with `done: false`
- [ ] Returns JSON with `reward: 0.0`
```bash
# Step with verdict
curl -X POST https://<username>-commitguard.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type":"verdict","is_vulnerable":true,"vuln_type":"CWE-89","exploit_sketch":"sql injection"}'
```
- [ ] Returns JSON with `reward` field (NOT 0.0 β should be +1.0 or -1.0)
- [ ] Returns JSON with `done: true`
### 1C. Env handles load
- [ ] Run 10 sequential resetβstep cycles β zero crashes
- [ ] Run 5 concurrent resetβstep cycles β zero crashes, no race conditions
- [ ] No request takes longer than 10 seconds
### 1D. Reward sanity
- [ ] Correct vulnerable verdict β reward > 0 (expected: +1.0)
- [ ] False positive (safe code flagged) β reward < 0 (expected: -1.0)
- [ ] False negative (vuln missed) β reward < 0 (expected: -0.5)
- [ ] Rewards are NOT all identical across different samples
---
## PHASE 2 β Data Verification
- [ ] `data/devign_train.jsonl` exists
- [ ] `wc -l data/devign_train.jsonl` β >1000 samples
- [ ] `data/devign_test.jsonl` exists
- [ ] `wc -l data/devign_test.jsonl` β exactly 100 samples
- [ ] Train and test commit_ids are disjoint (no overlap)
- [ ] Spot check 3 samples: `code_after` is non-empty, `is_vulnerable` is boolean
- [ ] No sample exceeds 80 lines of code
- [ ] Approximate 50/50 split between vulnerable and safe samples
---
## PHASE 3 β GPU & Dependencies
### 3A. Hardware
```bash
nvidia-smi
```
- [ ] GPU visible with β₯16GB VRAM
- [ ] GPU name matches expected (T4 / A10G / L4)
- [ ] Free VRAM β₯ 14GB (kill other processes if needed)
### 3B. Python environment
```bash
python --version
```
- [ ] Python 3.10 or 3.11 (NOT 3.12 β Unsloth compatibility issues)
### 3C. Critical libraries
```bash
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
python -c "from unsloth import FastLanguageModel; print('OK')"
python -c "from trl import GRPOTrainer; print('OK')"
python -c "from peft import PeftModel; print('OK')"
python -c "import wandb; print('OK')"
```
- [ ] torch β₯ 2.3.0, CUDA = True
- [ ] unsloth imports without error
- [ ] trl β₯ 0.12.0 imports without error
- [ ] peft imports without error
- [ ] wandb imports without error
---
## PHASE 4 β Model Loading Test
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
print("Model loaded successfully")
print(f"GPU memory: {torch.cuda.memory_allocated()/1e9:.1f}GB")
```
- [ ] Model loads without OOM
- [ ] GPU memory after load < 6GB (leaves room for GRPO overhead)
- [ ] No warnings about missing tokenizer files
### LoRA application
```python
model = FastLanguageModel.get_peft_model(
model, r=8, lora_alpha=16,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
```
- [ ] LoRA applies without error
- [ ] Trainable params ~3-8M (NOT the full 3B)
---
## PHASE 5 β Dry Run (2 Steps)
**THE MOST CRITICAL CHECK. DO NOT SKIP.**
```bash
python train_grpo.py --max_steps 2
```
### 5A. Generation
- [ ] First prompt formatted correctly (print it β does it contain a code diff?)
- [ ] 4 completions generated for first prompt
- [ ] At least 2 of 4 completions contain `<action_type>` XML tags
- [ ] Completions are different from each other (not all identical)
### 5B. Reward collection
- [ ] All 4 completions submitted to env
- [ ] All 4 rewards received (no timeouts)
- [ ] Rewards have variance (NOT all the same value)
- [ ] Rewards in expected range [-1.0, +2.0]
- [ ] Print rewards: `[_____, _____, _____, _____]` (write them down)
### 5C. Training step
- [ ] GRPO loss computed (finite number, not NaN, not inf, not 0.0)
- [ ] Loss value: _____ (write it down)
- [ ] Wandb shows run with 2 logged steps
- [ ] No OOM during backward pass
- [ ] Peak GPU memory: _____GB (must be < 22GB on A10G or < 14GB on T4)
### 5D. Checkpointing
- [ ] Output directory created: `./commitguard-llama-3b-grpo/`
- [ ] Checkpoint files present (or will be at step 50)
### 5E. Timing estimate
- [ ] 2 steps took _____ seconds
- [ ] Estimated time for 300 steps: _____ minutes (= 2-step-time Γ 150)
- [ ] Estimated cost: _____ dollars (hours Γ GPU hourly rate)
- [ ] Cost within budget? (must be under $8)
---
## PHASE 6 β Baseline Eval (Before Training)
**MUST run baseline BEFORE training. Cannot run after β you need the contrast.**
```bash
python evaluate.py \
--model_path meta-llama/Llama-3.2-3B-Instruct \
--test_file data/devign_test.jsonl \
--output eval_baseline.json
```
- [ ] Eval completes on all 100 test samples
- [ ] Binary accuracy: _____% (write it down, expected: 30-50%)
- [ ] CWE accuracy: _____% (expected: low, maybe 5-15%)
- [ ] False positive rate: _____%
- [ ] False negative rate: _____%
- [ ] Results saved to `eval_baseline.json`
- [ ] File committed to repo
---
## PHASE 7 β Launch Real Training
### Pre-launch final checks
- [ ] All phases 0-6 are GREEN
- [ ] Budget approved by Niti (team lead)
- [ ] Config confirmed:
- [ ] `max_steps = 300`
- [ ] `save_steps = 50`
- [ ] `logging_steps = 1`
- [ ] `num_generations = 4`
- [ ] `learning_rate = 5e-6`
- [ ] `report_to = "wandb"`
- [ ] HF Space is still healthy (re-check `/health`)
- [ ] Screenshot this checklist with all boxes ticked β post in team channel
### Launch
```bash
# Option A: HF Jobs (preferred)
hf jobs uv run --flavor a10g-large train_grpo.py
# Option B: GCP (fallback)
nohup python train_grpo.py > training.log 2>&1 &
```
- [ ] Job started successfully
- [ ] Job ID / Dashboard URL captured: _______________________
- [ ] Wandb run URL captured: _______________________
- [ ] Posted both URLs in team channel
- [ ] Set alarm to check in 30 minutes
---
## PHASE 8 β During Training Monitoring
**Check every 30 minutes while awake. Check immediately on waking up.**
### Quick health check (< 2 min each time)
| Time | reward/mean | reward/std | loss | GPU mem | Status |
|------|-------------|------------|------|---------|--------|
| +30m | _____ | _____ | _____ | _____ | β
/β οΈ/β |
| +1h | _____ | _____ | _____ | _____ | β
/β οΈ/β |
| +1.5h | _____ | _____ | _____ | _____ | β
/β οΈ/β |
| +2h | _____ | _____ | _____ | _____ | β
/β οΈ/β |
| Final | _____ | _____ | _____ | _____ | β
/β οΈ/β |
### Red flags β immediate action
| Red flag | Action |
|---|---|
| reward/mean trending DOWN | Check env `/health`. If healthy, lower LR to 2e-6 and relaunch from latest checkpoint. |
| loss = NaN | Kill run. Add `max_grad_norm=1.0` to config. Relaunch from checkpoint. |
| GPU memory > 23GB | Will OOM soon. Kill run. Reduce `num_generations` to 2. Relaunch. |
| Env returning errors in Wandb logs | HF Space is sleeping. Hit `/health` to wake. If down, Niti restarts. |
| Steps/second dropped to 0 | Job hung. Kill and relaunch from checkpoint. |
| All rewards identical for 50+ steps | Reward function bug. Ping Deepak. |
---
## PHASE 9 β Post-Training
### Immediately after training completes
- [ ] Training finished without crash
- [ ] Wandb run status: "finished"
- [ ] Final reward/mean: _____ (higher than step-1 reward? That's the curve.)
- [ ] Screenshot reward curve from Wandb β save as `plots/reward_curve.png`
- [ ] Final checkpoint exists in output directory
- [ ] Total training time: _____ hours
- [ ] Total cost: $_____
### Save the model
```bash
# Push LoRA adapter to HF Hub
huggingface-cli upload inmodel-labs/commitguard-llama-3b \
./commitguard-llama-3b-grpo/final
```
- [ ] Upload successful
- [ ] Model page visible at https://huggingface.co/inmodel-labs/commitguard-llama-3b
### Verify the saved model loads
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = PeftModel.from_pretrained(base, "inmodel-labs/commitguard-llama-3b")
print("Trained model loads correctly")
```
- [ ] Model loads without error
- [ ] Quick inference produces XML-tagged output (not garbage)
---
## PHASE 10 β Trained Model Eval
```bash
python evaluate.py \
--model_path ./commitguard-llama-3b-grpo/final \
--test_file data/devign_test.jsonl \
--is_lora \
--base_model meta-llama/Llama-3.2-3B-Instruct \
--output eval_trained.json
```
- [ ] Eval completes on all 100 test samples
- [ ] Binary accuracy: _____% (compare to baseline: _____%)
- [ ] CWE accuracy: _____% (compare to baseline: _____%)
- [ ] False positive rate: _____% (compare to baseline: _____%)
- [ ] False negative rate: _____% (compare to baseline: _____%)
- [ ] Results saved to `eval_trained.json`
- [ ] File committed to repo
### The verdict
- [ ] Trained accuracy > baseline accuracy? **YES / NO**
- [ ] If YES: by how many percentage points? _____pp
- [ ] If NO: check if qualitative outputs improved (reasoning traces better even if accuracy similar)
### Hand off to team
- [ ] Post in team channel:
```
TRAINING COMPLETE
Baseline accuracy: X%
Trained accuracy: Y%
Improvement: +Zpp
Wandb: [url]
Reward curve: [screenshot]
Model on Hub: inmodel-labs/commitguard-llama-3b
Ready for plots and README.
```
- [ ] Hand `eval_baseline.json` and `eval_trained.json` to Deepak for plot generation
- [ ] Kill GCP VM if running (`gcloud compute instances stop ...`)
- [ ] Update budget tracker in team channel
---
## PHASE 11 β Inference for Demo Video
**Divyank runs this to get the before/after examples for the demo recording.**
### Pick the demo sample
- [ ] Find ONE sample from test set where:
- Ground truth: vulnerable (preferably CWE-89 SQL injection)
- Baseline model gets it WRONG
- Trained model gets it RIGHT
- [ ] Sample commit_id: _______________________
### Generate baseline output
```python
# Load untrained model, generate response for the demo sample
# Save full text output to demo_baseline_output.txt
```
- [ ] Baseline output saved
- [ ] Output shows: wrong verdict / no reasoning / random guess
### Generate trained output
```python
# Load trained model, generate response for the demo sample
# Save full text output to demo_trained_output.txt
```
- [ ] Trained output saved
- [ ] Output shows: correct verdict / identifies CWE / sketches exploit
- [ ] The contrast between baseline and trained is VISIBLE and OBVIOUS
### Ready for recording
- [ ] Both outputs saved as text files for screen capture
- [ ] The diff for this sample is readable (not 80 lines of dense C)
- [ ] Proceed to demo video recording (see tasks_divyank.md)
---
## Emergency Fallback Reference Card
**Tape this next to your screen. Read it at 3 AM when your brain is mush.**
```
CRASHED? β Check Wandb β Is it OOM?
YES OOM β num_generations=2, retry from checkpoint
STILL OOM β Switch to Qwen2.5-1.5B, retry from scratch
NOT OOM β Check error message β Screenshot β Post in team channel
REWARDS ALL ZERO? β Env bug, not model bug
β curl /health on HF Space
β If dead: ping Niti
β If alive: curl /step manually, check reward value
β If reward from curl is also 0: Deepak's reward function bug
LLAMA ACCESS DENIED? β Switch to Qwen2.5-1.5B immediately
β Change ONE line: model_name="Qwen/Qwen2.5-1.5B-Instruct"
β Everything else stays the same
CURVE IS FLAT? β Ship it anyway with honest narrative
β "Training evidence shows optimization attempted;
reward signal needs richer shaping in future work"
β A flat curve + honest story > no submission
``` |