File size: 12,919 Bytes
b74db43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
# HF Training Checklist β€” CommitGuard

**Print this. Tick every box in order. Do NOT skip steps.**
**If any box fails: STOP. Fix before proceeding.**

---

## PHASE 0 β€” Account Setup (Do Once, Do NOW)

- [ ] `huggingface-cli login` β†’ authenticated
- [ ] `huggingface-cli whoami` β†’ shows your username
- [ ] HF credits visible at https://huggingface.co/settings/billing β†’ $30 showing
- [ ] Claim HF credits if not done: https://huggingface.co/coupons/claim/hf-openenv-community
- [ ] Llama-3.2-3B license accepted at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- [ ] License status: "You have been granted access" (NOT "pending")
- [ ] If pending after 30 min β†’ **SWITCH TO Qwen2.5-1.5B-Instruct. No waiting.**
- [ ] `wandb login` β†’ authenticated
- [ ] Wandb project created: `commitguard`

---

## PHASE 1 β€” Environment Health (Before ANY Training)

### 1A. HF Space is alive

```bash
curl https://<username>-commitguard.hf.space/health
```

- [ ] Returns `{"status": "healthy"}` with HTTP 200
- [ ] Response time < 3 seconds

### 1B. Env accepts actions

```bash
# Reset
curl -X POST https://<username>-commitguard.hf.space/reset
```

- [ ] Returns JSON with `diff` field (non-empty string)
- [ ] Returns JSON with `done: false`
- [ ] Returns JSON with `reward: 0.0`

```bash
# Step with verdict
curl -X POST https://<username>-commitguard.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type":"verdict","is_vulnerable":true,"vuln_type":"CWE-89","exploit_sketch":"sql injection"}'
```

- [ ] Returns JSON with `reward` field (NOT 0.0 β€” should be +1.0 or -1.0)
- [ ] Returns JSON with `done: true`

### 1C. Env handles load

- [ ] Run 10 sequential reset→step cycles → zero crashes
- [ ] Run 5 concurrent reset→step cycles → zero crashes, no race conditions
- [ ] No request takes longer than 10 seconds

### 1D. Reward sanity

- [ ] Correct vulnerable verdict β†’ reward > 0 (expected: +1.0)
- [ ] False positive (safe code flagged) β†’ reward < 0 (expected: -1.0)
- [ ] False negative (vuln missed) β†’ reward < 0 (expected: -0.5)
- [ ] Rewards are NOT all identical across different samples

---

## PHASE 2 β€” Data Verification

- [ ] `data/devign_train.jsonl` exists
- [ ] `wc -l data/devign_train.jsonl` β†’ >1000 samples
- [ ] `data/devign_test.jsonl` exists
- [ ] `wc -l data/devign_test.jsonl` β†’ exactly 100 samples
- [ ] Train and test commit_ids are disjoint (no overlap)
- [ ] Spot check 3 samples: `code_after` is non-empty, `is_vulnerable` is boolean
- [ ] No sample exceeds 80 lines of code
- [ ] Approximate 50/50 split between vulnerable and safe samples

---

## PHASE 3 β€” GPU & Dependencies

### 3A. Hardware

```bash
nvidia-smi
```

- [ ] GPU visible with β‰₯16GB VRAM
- [ ] GPU name matches expected (T4 / A10G / L4)
- [ ] Free VRAM β‰₯ 14GB (kill other processes if needed)

### 3B. Python environment

```bash
python --version
```

- [ ] Python 3.10 or 3.11 (NOT 3.12 β€” Unsloth compatibility issues)

### 3C. Critical libraries

```bash
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
python -c "from unsloth import FastLanguageModel; print('OK')"
python -c "from trl import GRPOTrainer; print('OK')"
python -c "from peft import PeftModel; print('OK')"
python -c "import wandb; print('OK')"
```

- [ ] torch β‰₯ 2.3.0, CUDA = True
- [ ] unsloth imports without error
- [ ] trl β‰₯ 0.12.0 imports without error
- [ ] peft imports without error
- [ ] wandb imports without error

---

## PHASE 4 β€” Model Loading Test

```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)
print("Model loaded successfully")
print(f"GPU memory: {torch.cuda.memory_allocated()/1e9:.1f}GB")
```

- [ ] Model loads without OOM
- [ ] GPU memory after load < 6GB (leaves room for GRPO overhead)
- [ ] No warnings about missing tokenizer files

### LoRA application

```python
model = FastLanguageModel.get_peft_model(
    model, r=8, lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
```

- [ ] LoRA applies without error
- [ ] Trainable params ~3-8M (NOT the full 3B)

---

## PHASE 5 β€” Dry Run (2 Steps)

**THE MOST CRITICAL CHECK. DO NOT SKIP.**

```bash
python train_grpo.py --max_steps 2
```

### 5A. Generation

- [ ] First prompt formatted correctly (print it β€” does it contain a code diff?)
- [ ] 4 completions generated for first prompt
- [ ] At least 2 of 4 completions contain `<action_type>` XML tags
- [ ] Completions are different from each other (not all identical)

### 5B. Reward collection

- [ ] All 4 completions submitted to env
- [ ] All 4 rewards received (no timeouts)
- [ ] Rewards have variance (NOT all the same value)
- [ ] Rewards in expected range [-1.0, +2.0]
- [ ] Print rewards: `[_____, _____, _____, _____]` (write them down)

### 5C. Training step

- [ ] GRPO loss computed (finite number, not NaN, not inf, not 0.0)
- [ ] Loss value: _____ (write it down)
- [ ] Wandb shows run with 2 logged steps
- [ ] No OOM during backward pass
- [ ] Peak GPU memory: _____GB (must be < 22GB on A10G or < 14GB on T4)

### 5D. Checkpointing

- [ ] Output directory created: `./commitguard-llama-3b-grpo/`
- [ ] Checkpoint files present (or will be at step 50)

### 5E. Timing estimate

- [ ] 2 steps took _____ seconds
- [ ] Estimated time for 300 steps: _____ minutes (= 2-step-time Γ— 150)
- [ ] Estimated cost: _____ dollars (hours Γ— GPU hourly rate)
- [ ] Cost within budget? (must be under $8)

---

## PHASE 6 β€” Baseline Eval (Before Training)

**MUST run baseline BEFORE training. Cannot run after β€” you need the contrast.**

```bash
python evaluate.py \
  --model_path meta-llama/Llama-3.2-3B-Instruct \
  --test_file data/devign_test.jsonl \
  --output eval_baseline.json
```

- [ ] Eval completes on all 100 test samples
- [ ] Binary accuracy: _____% (write it down, expected: 30-50%)
- [ ] CWE accuracy: _____% (expected: low, maybe 5-15%)
- [ ] False positive rate: _____%
- [ ] False negative rate: _____%
- [ ] Results saved to `eval_baseline.json`
- [ ] File committed to repo

---

## PHASE 7 β€” Launch Real Training

### Pre-launch final checks

- [ ] All phases 0-6 are GREEN
- [ ] Budget approved by Niti (team lead)
- [ ] Config confirmed:
  - [ ] `max_steps = 300`
  - [ ] `save_steps = 50`
  - [ ] `logging_steps = 1`
  - [ ] `num_generations = 4`
  - [ ] `learning_rate = 5e-6`
  - [ ] `report_to = "wandb"`
- [ ] HF Space is still healthy (re-check `/health`)
- [ ] Screenshot this checklist with all boxes ticked β†’ post in team channel

### Launch

```bash
# Option A: HF Jobs (preferred)
hf jobs uv run --flavor a10g-large train_grpo.py

# Option B: GCP (fallback)
nohup python train_grpo.py > training.log 2>&1 &
```

- [ ] Job started successfully
- [ ] Job ID / Dashboard URL captured: _______________________
- [ ] Wandb run URL captured: _______________________
- [ ] Posted both URLs in team channel
- [ ] Set alarm to check in 30 minutes

---

## PHASE 8 β€” During Training Monitoring

**Check every 30 minutes while awake. Check immediately on waking up.**

### Quick health check (< 2 min each time)

| Time | reward/mean | reward/std | loss | GPU mem | Status |
|------|-------------|------------|------|---------|--------|
| +30m | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| +1h  | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| +1.5h | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| +2h  | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| Final | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |

### Red flags β†’ immediate action

| Red flag | Action |
|---|---|
| reward/mean trending DOWN | Check env `/health`. If healthy, lower LR to 2e-6 and relaunch from latest checkpoint. |
| loss = NaN | Kill run. Add `max_grad_norm=1.0` to config. Relaunch from checkpoint. |
| GPU memory > 23GB | Will OOM soon. Kill run. Reduce `num_generations` to 2. Relaunch. |
| Env returning errors in Wandb logs | HF Space is sleeping. Hit `/health` to wake. If down, Niti restarts. |
| Steps/second dropped to 0 | Job hung. Kill and relaunch from checkpoint. |
| All rewards identical for 50+ steps | Reward function bug. Ping Deepak. |

---

## PHASE 9 β€” Post-Training

### Immediately after training completes

- [ ] Training finished without crash
- [ ] Wandb run status: "finished"
- [ ] Final reward/mean: _____ (higher than step-1 reward? That's the curve.)
- [ ] Screenshot reward curve from Wandb β†’ save as `plots/reward_curve.png`
- [ ] Final checkpoint exists in output directory
- [ ] Total training time: _____ hours
- [ ] Total cost: $_____

### Save the model

```bash
# Push LoRA adapter to HF Hub
huggingface-cli upload inmodel-labs/commitguard-llama-3b \
  ./commitguard-llama-3b-grpo/final
```

- [ ] Upload successful
- [ ] Model page visible at https://huggingface.co/inmodel-labs/commitguard-llama-3b

### Verify the saved model loads

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = PeftModel.from_pretrained(base, "inmodel-labs/commitguard-llama-3b")
print("Trained model loads correctly")
```

- [ ] Model loads without error
- [ ] Quick inference produces XML-tagged output (not garbage)

---

## PHASE 10 β€” Trained Model Eval

```bash
python evaluate.py \
  --model_path ./commitguard-llama-3b-grpo/final \
  --test_file data/devign_test.jsonl \
  --is_lora \
  --base_model meta-llama/Llama-3.2-3B-Instruct \
  --output eval_trained.json
```

- [ ] Eval completes on all 100 test samples
- [ ] Binary accuracy: _____% (compare to baseline: _____%)
- [ ] CWE accuracy: _____% (compare to baseline: _____%)
- [ ] False positive rate: _____% (compare to baseline: _____%)
- [ ] False negative rate: _____% (compare to baseline: _____%)
- [ ] Results saved to `eval_trained.json`
- [ ] File committed to repo

### The verdict

- [ ] Trained accuracy > baseline accuracy? **YES / NO**
- [ ] If YES: by how many percentage points? _____pp
- [ ] If NO: check if qualitative outputs improved (reasoning traces better even if accuracy similar)

### Hand off to team

- [ ] Post in team channel:
  ```
  TRAINING COMPLETE
  Baseline accuracy: X%
  Trained accuracy: Y%
  Improvement: +Zpp
  Wandb: [url]
  Reward curve: [screenshot]
  Model on Hub: inmodel-labs/commitguard-llama-3b
  Ready for plots and README.
  ```
- [ ] Hand `eval_baseline.json` and `eval_trained.json` to Deepak for plot generation
- [ ] Kill GCP VM if running (`gcloud compute instances stop ...`)
- [ ] Update budget tracker in team channel

---

## PHASE 11 β€” Inference for Demo Video

**Divyank runs this to get the before/after examples for the demo recording.**

### Pick the demo sample

- [ ] Find ONE sample from test set where:
  - Ground truth: vulnerable (preferably CWE-89 SQL injection)
  - Baseline model gets it WRONG
  - Trained model gets it RIGHT
- [ ] Sample commit_id: _______________________

### Generate baseline output

```python
# Load untrained model, generate response for the demo sample
# Save full text output to demo_baseline_output.txt
```

- [ ] Baseline output saved
- [ ] Output shows: wrong verdict / no reasoning / random guess

### Generate trained output

```python
# Load trained model, generate response for the demo sample
# Save full text output to demo_trained_output.txt
```

- [ ] Trained output saved
- [ ] Output shows: correct verdict / identifies CWE / sketches exploit
- [ ] The contrast between baseline and trained is VISIBLE and OBVIOUS

### Ready for recording

- [ ] Both outputs saved as text files for screen capture
- [ ] The diff for this sample is readable (not 80 lines of dense C)
- [ ] Proceed to demo video recording (see tasks_divyank.md)

---

## Emergency Fallback Reference Card

**Tape this next to your screen. Read it at 3 AM when your brain is mush.**

```
CRASHED? β†’ Check Wandb β†’ Is it OOM?
  YES OOM β†’ num_generations=2, retry from checkpoint
  STILL OOM β†’ Switch to Qwen2.5-1.5B, retry from scratch
  NOT OOM β†’ Check error message β†’ Screenshot β†’ Post in team channel

REWARDS ALL ZERO? β†’ Env bug, not model bug
  β†’ curl /health on HF Space
  β†’ If dead: ping Niti
  β†’ If alive: curl /step manually, check reward value
  β†’ If reward from curl is also 0: Deepak's reward function bug

LLAMA ACCESS DENIED? β†’ Switch to Qwen2.5-1.5B immediately
  β†’ Change ONE line: model_name="Qwen/Qwen2.5-1.5B-Instruct"
  β†’ Everything else stays the same

CURVE IS FLAT? β†’ Ship it anyway with honest narrative
  β†’ "Training evidence shows optimization attempted;
     reward signal needs richer shaping in future work"
  β†’ A flat curve + honest story > no submission
```