commitguard-env / current.md
Nitishkumar-ai's picture
Initial clean deploy commit
b74db43
# HF Training Checklist β€” CommitGuard
**Print this. Tick every box in order. Do NOT skip steps.**
**If any box fails: STOP. Fix before proceeding.**
---
## PHASE 0 β€” Account Setup (Do Once, Do NOW)
- [ ] `huggingface-cli login` β†’ authenticated
- [ ] `huggingface-cli whoami` β†’ shows your username
- [ ] HF credits visible at https://huggingface.co/settings/billing β†’ $30 showing
- [ ] Claim HF credits if not done: https://huggingface.co/coupons/claim/hf-openenv-community
- [ ] Llama-3.2-3B license accepted at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- [ ] License status: "You have been granted access" (NOT "pending")
- [ ] If pending after 30 min β†’ **SWITCH TO Qwen2.5-1.5B-Instruct. No waiting.**
- [ ] `wandb login` β†’ authenticated
- [ ] Wandb project created: `commitguard`
---
## PHASE 1 β€” Environment Health (Before ANY Training)
### 1A. HF Space is alive
```bash
curl https://<username>-commitguard.hf.space/health
```
- [ ] Returns `{"status": "healthy"}` with HTTP 200
- [ ] Response time < 3 seconds
### 1B. Env accepts actions
```bash
# Reset
curl -X POST https://<username>-commitguard.hf.space/reset
```
- [ ] Returns JSON with `diff` field (non-empty string)
- [ ] Returns JSON with `done: false`
- [ ] Returns JSON with `reward: 0.0`
```bash
# Step with verdict
curl -X POST https://<username>-commitguard.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action_type":"verdict","is_vulnerable":true,"vuln_type":"CWE-89","exploit_sketch":"sql injection"}'
```
- [ ] Returns JSON with `reward` field (NOT 0.0 β€” should be +1.0 or -1.0)
- [ ] Returns JSON with `done: true`
### 1C. Env handles load
- [ ] Run 10 sequential reset→step cycles → zero crashes
- [ ] Run 5 concurrent reset→step cycles → zero crashes, no race conditions
- [ ] No request takes longer than 10 seconds
### 1D. Reward sanity
- [ ] Correct vulnerable verdict β†’ reward > 0 (expected: +1.0)
- [ ] False positive (safe code flagged) β†’ reward < 0 (expected: -1.0)
- [ ] False negative (vuln missed) β†’ reward < 0 (expected: -0.5)
- [ ] Rewards are NOT all identical across different samples
---
## PHASE 2 β€” Data Verification
- [ ] `data/devign_train.jsonl` exists
- [ ] `wc -l data/devign_train.jsonl` β†’ >1000 samples
- [ ] `data/devign_test.jsonl` exists
- [ ] `wc -l data/devign_test.jsonl` β†’ exactly 100 samples
- [ ] Train and test commit_ids are disjoint (no overlap)
- [ ] Spot check 3 samples: `code_after` is non-empty, `is_vulnerable` is boolean
- [ ] No sample exceeds 80 lines of code
- [ ] Approximate 50/50 split between vulnerable and safe samples
---
## PHASE 3 β€” GPU & Dependencies
### 3A. Hardware
```bash
nvidia-smi
```
- [ ] GPU visible with β‰₯16GB VRAM
- [ ] GPU name matches expected (T4 / A10G / L4)
- [ ] Free VRAM β‰₯ 14GB (kill other processes if needed)
### 3B. Python environment
```bash
python --version
```
- [ ] Python 3.10 or 3.11 (NOT 3.12 β€” Unsloth compatibility issues)
### 3C. Critical libraries
```bash
python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
python -c "from unsloth import FastLanguageModel; print('OK')"
python -c "from trl import GRPOTrainer; print('OK')"
python -c "from peft import PeftModel; print('OK')"
python -c "import wandb; print('OK')"
```
- [ ] torch β‰₯ 2.3.0, CUDA = True
- [ ] unsloth imports without error
- [ ] trl β‰₯ 0.12.0 imports without error
- [ ] peft imports without error
- [ ] wandb imports without error
---
## PHASE 4 β€” Model Loading Test
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"meta-llama/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
print("Model loaded successfully")
print(f"GPU memory: {torch.cuda.memory_allocated()/1e9:.1f}GB")
```
- [ ] Model loads without OOM
- [ ] GPU memory after load < 6GB (leaves room for GRPO overhead)
- [ ] No warnings about missing tokenizer files
### LoRA application
```python
model = FastLanguageModel.get_peft_model(
model, r=8, lora_alpha=16,
target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
```
- [ ] LoRA applies without error
- [ ] Trainable params ~3-8M (NOT the full 3B)
---
## PHASE 5 β€” Dry Run (2 Steps)
**THE MOST CRITICAL CHECK. DO NOT SKIP.**
```bash
python train_grpo.py --max_steps 2
```
### 5A. Generation
- [ ] First prompt formatted correctly (print it β€” does it contain a code diff?)
- [ ] 4 completions generated for first prompt
- [ ] At least 2 of 4 completions contain `<action_type>` XML tags
- [ ] Completions are different from each other (not all identical)
### 5B. Reward collection
- [ ] All 4 completions submitted to env
- [ ] All 4 rewards received (no timeouts)
- [ ] Rewards have variance (NOT all the same value)
- [ ] Rewards in expected range [-1.0, +2.0]
- [ ] Print rewards: `[_____, _____, _____, _____]` (write them down)
### 5C. Training step
- [ ] GRPO loss computed (finite number, not NaN, not inf, not 0.0)
- [ ] Loss value: _____ (write it down)
- [ ] Wandb shows run with 2 logged steps
- [ ] No OOM during backward pass
- [ ] Peak GPU memory: _____GB (must be < 22GB on A10G or < 14GB on T4)
### 5D. Checkpointing
- [ ] Output directory created: `./commitguard-llama-3b-grpo/`
- [ ] Checkpoint files present (or will be at step 50)
### 5E. Timing estimate
- [ ] 2 steps took _____ seconds
- [ ] Estimated time for 300 steps: _____ minutes (= 2-step-time Γ— 150)
- [ ] Estimated cost: _____ dollars (hours Γ— GPU hourly rate)
- [ ] Cost within budget? (must be under $8)
---
## PHASE 6 β€” Baseline Eval (Before Training)
**MUST run baseline BEFORE training. Cannot run after β€” you need the contrast.**
```bash
python evaluate.py \
--model_path meta-llama/Llama-3.2-3B-Instruct \
--test_file data/devign_test.jsonl \
--output eval_baseline.json
```
- [ ] Eval completes on all 100 test samples
- [ ] Binary accuracy: _____% (write it down, expected: 30-50%)
- [ ] CWE accuracy: _____% (expected: low, maybe 5-15%)
- [ ] False positive rate: _____%
- [ ] False negative rate: _____%
- [ ] Results saved to `eval_baseline.json`
- [ ] File committed to repo
---
## PHASE 7 β€” Launch Real Training
### Pre-launch final checks
- [ ] All phases 0-6 are GREEN
- [ ] Budget approved by Niti (team lead)
- [ ] Config confirmed:
- [ ] `max_steps = 300`
- [ ] `save_steps = 50`
- [ ] `logging_steps = 1`
- [ ] `num_generations = 4`
- [ ] `learning_rate = 5e-6`
- [ ] `report_to = "wandb"`
- [ ] HF Space is still healthy (re-check `/health`)
- [ ] Screenshot this checklist with all boxes ticked β†’ post in team channel
### Launch
```bash
# Option A: HF Jobs (preferred)
hf jobs uv run --flavor a10g-large train_grpo.py
# Option B: GCP (fallback)
nohup python train_grpo.py > training.log 2>&1 &
```
- [ ] Job started successfully
- [ ] Job ID / Dashboard URL captured: _______________________
- [ ] Wandb run URL captured: _______________________
- [ ] Posted both URLs in team channel
- [ ] Set alarm to check in 30 minutes
---
## PHASE 8 β€” During Training Monitoring
**Check every 30 minutes while awake. Check immediately on waking up.**
### Quick health check (< 2 min each time)
| Time | reward/mean | reward/std | loss | GPU mem | Status |
|------|-------------|------------|------|---------|--------|
| +30m | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| +1h | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| +1.5h | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| +2h | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
| Final | _____ | _____ | _____ | _____ | βœ…/⚠️/❌ |
### Red flags β†’ immediate action
| Red flag | Action |
|---|---|
| reward/mean trending DOWN | Check env `/health`. If healthy, lower LR to 2e-6 and relaunch from latest checkpoint. |
| loss = NaN | Kill run. Add `max_grad_norm=1.0` to config. Relaunch from checkpoint. |
| GPU memory > 23GB | Will OOM soon. Kill run. Reduce `num_generations` to 2. Relaunch. |
| Env returning errors in Wandb logs | HF Space is sleeping. Hit `/health` to wake. If down, Niti restarts. |
| Steps/second dropped to 0 | Job hung. Kill and relaunch from checkpoint. |
| All rewards identical for 50+ steps | Reward function bug. Ping Deepak. |
---
## PHASE 9 β€” Post-Training
### Immediately after training completes
- [ ] Training finished without crash
- [ ] Wandb run status: "finished"
- [ ] Final reward/mean: _____ (higher than step-1 reward? That's the curve.)
- [ ] Screenshot reward curve from Wandb β†’ save as `plots/reward_curve.png`
- [ ] Final checkpoint exists in output directory
- [ ] Total training time: _____ hours
- [ ] Total cost: $_____
### Save the model
```bash
# Push LoRA adapter to HF Hub
huggingface-cli upload inmodel-labs/commitguard-llama-3b \
./commitguard-llama-3b-grpo/final
```
- [ ] Upload successful
- [ ] Model page visible at https://huggingface.co/inmodel-labs/commitguard-llama-3b
### Verify the saved model loads
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = PeftModel.from_pretrained(base, "inmodel-labs/commitguard-llama-3b")
print("Trained model loads correctly")
```
- [ ] Model loads without error
- [ ] Quick inference produces XML-tagged output (not garbage)
---
## PHASE 10 β€” Trained Model Eval
```bash
python evaluate.py \
--model_path ./commitguard-llama-3b-grpo/final \
--test_file data/devign_test.jsonl \
--is_lora \
--base_model meta-llama/Llama-3.2-3B-Instruct \
--output eval_trained.json
```
- [ ] Eval completes on all 100 test samples
- [ ] Binary accuracy: _____% (compare to baseline: _____%)
- [ ] CWE accuracy: _____% (compare to baseline: _____%)
- [ ] False positive rate: _____% (compare to baseline: _____%)
- [ ] False negative rate: _____% (compare to baseline: _____%)
- [ ] Results saved to `eval_trained.json`
- [ ] File committed to repo
### The verdict
- [ ] Trained accuracy > baseline accuracy? **YES / NO**
- [ ] If YES: by how many percentage points? _____pp
- [ ] If NO: check if qualitative outputs improved (reasoning traces better even if accuracy similar)
### Hand off to team
- [ ] Post in team channel:
```
TRAINING COMPLETE
Baseline accuracy: X%
Trained accuracy: Y%
Improvement: +Zpp
Wandb: [url]
Reward curve: [screenshot]
Model on Hub: inmodel-labs/commitguard-llama-3b
Ready for plots and README.
```
- [ ] Hand `eval_baseline.json` and `eval_trained.json` to Deepak for plot generation
- [ ] Kill GCP VM if running (`gcloud compute instances stop ...`)
- [ ] Update budget tracker in team channel
---
## PHASE 11 β€” Inference for Demo Video
**Divyank runs this to get the before/after examples for the demo recording.**
### Pick the demo sample
- [ ] Find ONE sample from test set where:
- Ground truth: vulnerable (preferably CWE-89 SQL injection)
- Baseline model gets it WRONG
- Trained model gets it RIGHT
- [ ] Sample commit_id: _______________________
### Generate baseline output
```python
# Load untrained model, generate response for the demo sample
# Save full text output to demo_baseline_output.txt
```
- [ ] Baseline output saved
- [ ] Output shows: wrong verdict / no reasoning / random guess
### Generate trained output
```python
# Load trained model, generate response for the demo sample
# Save full text output to demo_trained_output.txt
```
- [ ] Trained output saved
- [ ] Output shows: correct verdict / identifies CWE / sketches exploit
- [ ] The contrast between baseline and trained is VISIBLE and OBVIOUS
### Ready for recording
- [ ] Both outputs saved as text files for screen capture
- [ ] The diff for this sample is readable (not 80 lines of dense C)
- [ ] Proceed to demo video recording (see tasks_divyank.md)
---
## Emergency Fallback Reference Card
**Tape this next to your screen. Read it at 3 AM when your brain is mush.**
```
CRASHED? β†’ Check Wandb β†’ Is it OOM?
YES OOM β†’ num_generations=2, retry from checkpoint
STILL OOM β†’ Switch to Qwen2.5-1.5B, retry from scratch
NOT OOM β†’ Check error message β†’ Screenshot β†’ Post in team channel
REWARDS ALL ZERO? β†’ Env bug, not model bug
β†’ curl /health on HF Space
β†’ If dead: ping Niti
β†’ If alive: curl /step manually, check reward value
β†’ If reward from curl is also 0: Deepak's reward function bug
LLAMA ACCESS DENIED? β†’ Switch to Qwen2.5-1.5B immediately
β†’ Change ONE line: model_name="Qwen/Qwen2.5-1.5B-Instruct"
β†’ Everything else stays the same
CURVE IS FLAT? β†’ Ship it anyway with honest narrative
β†’ "Training evidence shows optimization attempted;
reward signal needs richer shaping in future work"
β†’ A flat curve + honest story > no submission
```