commitguard-env / current.md
Nitishkumar-ai's picture
Initial clean deploy commit
b74db43

HF Training Checklist β€” CommitGuard

Print this. Tick every box in order. Do NOT skip steps. If any box fails: STOP. Fix before proceeding.


PHASE 0 β€” Account Setup (Do Once, Do NOW)


PHASE 1 β€” Environment Health (Before ANY Training)

1A. HF Space is alive

curl https://<username>-commitguard.hf.space/health
  • Returns {"status": "healthy"} with HTTP 200
  • Response time < 3 seconds

1B. Env accepts actions

# Reset
curl -X POST https://<username>-commitguard.hf.space/reset
  • Returns JSON with diff field (non-empty string)
  • Returns JSON with done: false
  • Returns JSON with reward: 0.0
# Step with verdict
curl -X POST https://<username>-commitguard.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action_type":"verdict","is_vulnerable":true,"vuln_type":"CWE-89","exploit_sketch":"sql injection"}'
  • Returns JSON with reward field (NOT 0.0 β€” should be +1.0 or -1.0)
  • Returns JSON with done: true

1C. Env handles load

  • Run 10 sequential resetβ†’step cycles β†’ zero crashes
  • Run 5 concurrent resetβ†’step cycles β†’ zero crashes, no race conditions
  • No request takes longer than 10 seconds

1D. Reward sanity

  • Correct vulnerable verdict β†’ reward > 0 (expected: +1.0)
  • False positive (safe code flagged) β†’ reward < 0 (expected: -1.0)
  • False negative (vuln missed) β†’ reward < 0 (expected: -0.5)
  • Rewards are NOT all identical across different samples

PHASE 2 β€” Data Verification

  • data/devign_train.jsonl exists
  • wc -l data/devign_train.jsonl β†’ >1000 samples
  • data/devign_test.jsonl exists
  • wc -l data/devign_test.jsonl β†’ exactly 100 samples
  • Train and test commit_ids are disjoint (no overlap)
  • Spot check 3 samples: code_after is non-empty, is_vulnerable is boolean
  • No sample exceeds 80 lines of code
  • Approximate 50/50 split between vulnerable and safe samples

PHASE 3 β€” GPU & Dependencies

3A. Hardware

nvidia-smi
  • GPU visible with β‰₯16GB VRAM
  • GPU name matches expected (T4 / A10G / L4)
  • Free VRAM β‰₯ 14GB (kill other processes if needed)

3B. Python environment

python --version
  • Python 3.10 or 3.11 (NOT 3.12 β€” Unsloth compatibility issues)

3C. Critical libraries

python -c "import torch; print(torch.__version__, torch.cuda.is_available())"
python -c "from unsloth import FastLanguageModel; print('OK')"
python -c "from trl import GRPOTrainer; print('OK')"
python -c "from peft import PeftModel; print('OK')"
python -c "import wandb; print('OK')"
  • torch β‰₯ 2.3.0, CUDA = True
  • unsloth imports without error
  • trl β‰₯ 0.12.0 imports without error
  • peft imports without error
  • wandb imports without error

PHASE 4 β€” Model Loading Test

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)
print("Model loaded successfully")
print(f"GPU memory: {torch.cuda.memory_allocated()/1e9:.1f}GB")
  • Model loads without OOM
  • GPU memory after load < 6GB (leaves room for GRPO overhead)
  • No warnings about missing tokenizer files

LoRA application

model = FastLanguageModel.get_peft_model(
    model, r=8, lora_alpha=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj"],
)
print(f"Trainable params: {sum(p.numel() for p in model.parameters() if p.requires_grad)}")
  • LoRA applies without error
  • Trainable params ~3-8M (NOT the full 3B)

PHASE 5 β€” Dry Run (2 Steps)

THE MOST CRITICAL CHECK. DO NOT SKIP.

python train_grpo.py --max_steps 2

5A. Generation

  • First prompt formatted correctly (print it β€” does it contain a code diff?)
  • 4 completions generated for first prompt
  • At least 2 of 4 completions contain <action_type> XML tags
  • Completions are different from each other (not all identical)

5B. Reward collection

  • All 4 completions submitted to env
  • All 4 rewards received (no timeouts)
  • Rewards have variance (NOT all the same value)
  • Rewards in expected range [-1.0, +2.0]
  • Print rewards: [_____, _____, _____, _____] (write them down)

5C. Training step

  • GRPO loss computed (finite number, not NaN, not inf, not 0.0)
  • Loss value: _____ (write it down)
  • Wandb shows run with 2 logged steps
  • No OOM during backward pass
  • Peak GPU memory: _____GB (must be < 22GB on A10G or < 14GB on T4)

5D. Checkpointing

  • Output directory created: ./commitguard-llama-3b-grpo/
  • Checkpoint files present (or will be at step 50)

5E. Timing estimate

  • 2 steps took _____ seconds
  • Estimated time for 300 steps: _____ minutes (= 2-step-time Γ— 150)
  • Estimated cost: _____ dollars (hours Γ— GPU hourly rate)
  • Cost within budget? (must be under $8)

PHASE 6 β€” Baseline Eval (Before Training)

MUST run baseline BEFORE training. Cannot run after β€” you need the contrast.

python evaluate.py \
  --model_path meta-llama/Llama-3.2-3B-Instruct \
  --test_file data/devign_test.jsonl \
  --output eval_baseline.json
  • Eval completes on all 100 test samples
  • Binary accuracy: _____% (write it down, expected: 30-50%)
  • CWE accuracy: _____% (expected: low, maybe 5-15%)
  • False positive rate: _____%
  • False negative rate: _____%
  • Results saved to eval_baseline.json
  • File committed to repo

PHASE 7 β€” Launch Real Training

Pre-launch final checks

  • All phases 0-6 are GREEN
  • Budget approved by Niti (team lead)
  • Config confirmed:
    • max_steps = 300
    • save_steps = 50
    • logging_steps = 1
    • num_generations = 4
    • learning_rate = 5e-6
    • report_to = "wandb"
  • HF Space is still healthy (re-check /health)
  • Screenshot this checklist with all boxes ticked β†’ post in team channel

Launch

# Option A: HF Jobs (preferred)
hf jobs uv run --flavor a10g-large train_grpo.py

# Option B: GCP (fallback)
nohup python train_grpo.py > training.log 2>&1 &
  • Job started successfully
  • Job ID / Dashboard URL captured: _______________________
  • Wandb run URL captured: _______________________
  • Posted both URLs in team channel
  • Set alarm to check in 30 minutes

PHASE 8 β€” During Training Monitoring

Check every 30 minutes while awake. Check immediately on waking up.

Quick health check (< 2 min each time)

Time reward/mean reward/std loss GPU mem Status
+30m _____ _____ _____ _____ βœ…/⚠️/❌
+1h _____ _____ _____ _____ βœ…/⚠️/❌
+1.5h _____ _____ _____ _____ βœ…/⚠️/❌
+2h _____ _____ _____ _____ βœ…/⚠️/❌
Final _____ _____ _____ _____ βœ…/⚠️/❌

Red flags β†’ immediate action

Red flag Action
reward/mean trending DOWN Check env /health. If healthy, lower LR to 2e-6 and relaunch from latest checkpoint.
loss = NaN Kill run. Add max_grad_norm=1.0 to config. Relaunch from checkpoint.
GPU memory > 23GB Will OOM soon. Kill run. Reduce num_generations to 2. Relaunch.
Env returning errors in Wandb logs HF Space is sleeping. Hit /health to wake. If down, Niti restarts.
Steps/second dropped to 0 Job hung. Kill and relaunch from checkpoint.
All rewards identical for 50+ steps Reward function bug. Ping Deepak.

PHASE 9 β€” Post-Training

Immediately after training completes

  • Training finished without crash
  • Wandb run status: "finished"
  • Final reward/mean: _____ (higher than step-1 reward? That's the curve.)
  • Screenshot reward curve from Wandb β†’ save as plots/reward_curve.png
  • Final checkpoint exists in output directory
  • Total training time: _____ hours
  • Total cost: $_____

Save the model

# Push LoRA adapter to HF Hub
huggingface-cli upload inmodel-labs/commitguard-llama-3b \
  ./commitguard-llama-3b-grpo/final

Verify the saved model loads

from peft import PeftModel
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
model = PeftModel.from_pretrained(base, "inmodel-labs/commitguard-llama-3b")
print("Trained model loads correctly")
  • Model loads without error
  • Quick inference produces XML-tagged output (not garbage)

PHASE 10 β€” Trained Model Eval

python evaluate.py \
  --model_path ./commitguard-llama-3b-grpo/final \
  --test_file data/devign_test.jsonl \
  --is_lora \
  --base_model meta-llama/Llama-3.2-3B-Instruct \
  --output eval_trained.json
  • Eval completes on all 100 test samples
  • Binary accuracy: _____% (compare to baseline: _____%)
  • CWE accuracy: _____% (compare to baseline: _____%)
  • False positive rate: _____% (compare to baseline: _____%)
  • False negative rate: _____% (compare to baseline: _____%)
  • Results saved to eval_trained.json
  • File committed to repo

The verdict

  • Trained accuracy > baseline accuracy? YES / NO
  • If YES: by how many percentage points? _____pp
  • If NO: check if qualitative outputs improved (reasoning traces better even if accuracy similar)

Hand off to team

  • Post in team channel:
    TRAINING COMPLETE
    Baseline accuracy: X%
    Trained accuracy: Y%
    Improvement: +Zpp
    Wandb: [url]
    Reward curve: [screenshot]
    Model on Hub: inmodel-labs/commitguard-llama-3b
    Ready for plots and README.
    
  • Hand eval_baseline.json and eval_trained.json to Deepak for plot generation
  • Kill GCP VM if running (gcloud compute instances stop ...)
  • Update budget tracker in team channel

PHASE 11 β€” Inference for Demo Video

Divyank runs this to get the before/after examples for the demo recording.

Pick the demo sample

  • Find ONE sample from test set where:
    • Ground truth: vulnerable (preferably CWE-89 SQL injection)
    • Baseline model gets it WRONG
    • Trained model gets it RIGHT
  • Sample commit_id: _______________________

Generate baseline output

# Load untrained model, generate response for the demo sample
# Save full text output to demo_baseline_output.txt
  • Baseline output saved
  • Output shows: wrong verdict / no reasoning / random guess

Generate trained output

# Load trained model, generate response for the demo sample
# Save full text output to demo_trained_output.txt
  • Trained output saved
  • Output shows: correct verdict / identifies CWE / sketches exploit
  • The contrast between baseline and trained is VISIBLE and OBVIOUS

Ready for recording

  • Both outputs saved as text files for screen capture
  • The diff for this sample is readable (not 80 lines of dense C)
  • Proceed to demo video recording (see tasks_divyank.md)

Emergency Fallback Reference Card

Tape this next to your screen. Read it at 3 AM when your brain is mush.

CRASHED? β†’ Check Wandb β†’ Is it OOM?
  YES OOM β†’ num_generations=2, retry from checkpoint
  STILL OOM β†’ Switch to Qwen2.5-1.5B, retry from scratch
  NOT OOM β†’ Check error message β†’ Screenshot β†’ Post in team channel

REWARDS ALL ZERO? β†’ Env bug, not model bug
  β†’ curl /health on HF Space
  β†’ If dead: ping Niti
  β†’ If alive: curl /step manually, check reward value
  β†’ If reward from curl is also 0: Deepak's reward function bug

LLAMA ACCESS DENIED? β†’ Switch to Qwen2.5-1.5B immediately
  β†’ Change ONE line: model_name="Qwen/Qwen2.5-1.5B-Instruct"
  β†’ Everything else stays the same

CURVE IS FLAT? β†’ Ship it anyway with honest narrative
  β†’ "Training evidence shows optimization attempted;
     reward signal needs richer shaping in future work"
  β†’ A flat curve + honest story > no submission