File size: 10,894 Bytes
e4f3d12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# Tasks  Deepak (Data + Reward + Evaluation)

**Project:** CommitGuard  OpenEnv Hackathon Submission
**Submission deadline:** Sunday 5:00 PM IST
**Your role:** Own the data pipeline, the reward function, and the evaluation that produces the plots judges will see.

---

## Why you own these

The reward function is the soul of the env  it determines whether the agent learns anything at all. The data pipeline determines whether the env can scale. The evaluation produces the plots that drive 20% of the rubric score directly. Three things, all surgical, all yours.

You can start immediately  your work doesn't depend on Niti's env code being ready.

---

## Phase 1  Foundation (9:30 PM Saturday  12:30 AM Sunday)

### Task 1.1  Devign data preprocessing (1.5 hours)

**Goal:** A single JSONL file with clean, balanced, filtered samples ready for the env.

- [x] Verify Devign dataset is on disk locally. If not, download from HuggingFace: `DetectBERT/devign` or the original Devign release
- [x] Write `preprocess_devign.py`:
  - Load all samples
  - Filter: drop samples where `len(code.split('\n')) > 80` (keeps context windows manageable for Qwen-0.5B / Llama-3.2-3B)
  - Filter: drop samples without a clear CWE label
  - Balance: roughly 50/50 split between vulnerable and safe
  - Output schema per row:
    ```json

    {

      "commit_id": "synthetic_0001",

      "code_before": "...",

      "code_after": "...",

      "is_vulnerable": true,

      "cwe_type": "CWE-89",

      "target_file": "auth.c",

      "available_files": ["auth.c", "db.c", "utils.c"]

    }

    ```

  - Note: Devign doesn't have real "before/after" diffs  synthesize by treating each function as `code_after` and using a slightly mutated version (or just an empty string + `code_after`) as `code_before`. Don't overthink this; the diff representation is what matters.

- [x] Save to `data/devign_filtered.jsonl`

- [x] Aim for ~5000 samples post-filter. If you have fewer, that's fine  quality over quantity.

- [x] Smoke test: `wc -l data/devign_filtered.jsonl` and spot-check 5 random samples manually for sanity


### Task 1.2  CWE keyword dictionary (30 min)

**Goal:** Map each top-10 CWE to a list of exploit-pattern keywords for reward computation.

- [x] Identify the top 10 CWEs by frequency in your filtered dataset
- [x] For each CWE, list 5-10 keywords/phrases that would appear in a plausible exploit description
- [x] Save to `cwe_keywords.json`:
  ```json

  {

    "CWE-89": ["sql injection", "drop table", "union select", "or 1=1", "concat", "unsanitized input"],

    "CWE-79": ["xss", "script tag", "innerhtml", "eval", "javascript:", "onerror"],

    "CWE-78": ["command injection", "os.system", "subprocess", "shell=true", "exec", "popen"],

    ...

  }

  ```
- [x] Source from MITRE CWE pages (mitre.org/data/definitions/89.html etc.)  copy the exploit examples, extract phrases

### Task 1.3  Reward function (1 hour)

**Goal:** Pure function that takes an action + ground truth and returns a scalar reward. Tested.

- [x] Write `reward.py`:
  ```python

  def compute_reward(action: dict, ground_truth: dict, cwe_keywords: dict, step_count: int) -> float:

      reward = 0.0



      # Per-step efficiency penalty

      if action["action_type"] == "request_context":

          return -0.05



      # Analyze action  no reward, just logged

      if action["action_type"] == "analyze":

          return 0.0



      # Verdict action  main reward signal

      if action["action_type"] == "verdict":

          # Correctness of binary classification

          if action["is_vulnerable"] == ground_truth["is_vulnerable"]:

              reward += 1.0

              # Bonus: correct CWE classification

              if ground_truth["is_vulnerable"] and action["vuln_type"] == ground_truth["cwe_type"]:

                  reward += 0.5

              # Bonus: plausible exploit sketch

              if ground_truth["is_vulnerable"] and action["exploit_sketch"]:

                  patterns = cwe_keywords.get(ground_truth["cwe_type"], [])

                  sketch_lower = action["exploit_sketch"].lower()

                  if any(p in sketch_lower for p in patterns):

                      reward += 0.5

          else:

              # Wrong classification

              if action["is_vulnerable"] and not ground_truth["is_vulnerable"]:

                  reward -= 1.0  # False positive

              else:

                  reward -= 0.5  # False negative



      return reward

  ```
- [x] Write 5 hand-crafted unit tests in `test_reward.py`:
  - Correct vulnerable verdict  reward = 1.0
  - Correct vulnerable + correct CWE + good sketch  reward = 2.0
  - False positive (flagged safe as vulnerable)  reward = -1.0
  - False negative (missed real vuln)  reward = -0.5
  - Context request  reward = -0.05
- [x] All tests pass

### Task 1.4  No-leak unit test (30 min)

**Goal:** A test that fails loudly if Niti accidentally leaks ground truth into the observation.

- [x] Write `test_no_leak.py`:
  ```python

  def test_observation_does_not_leak_ground_truth():

      env = CommitGuardEnvironment()

      obs = env.reset()

      obs_dict = asdict(obs)

      forbidden_keys = ["is_vulnerable", "cwe_type", "ground_truth", "label"]

      for key in forbidden_keys:

          assert key not in str(obs_dict).lower(), f"Leak detected: {key}"

      # Also check after a step

      obs = env.step(CommitGuardAction(action_type="analyze", reasoning="test"))

      for key in forbidden_keys:

          assert key not in str(asdict(obs)).lower()

  ```
- [x] Run against Niti's env once it's ready. Test must pass.

**Hard checkpoint at midnight:** JSONL exists, reward function passes 5 unit tests, no-leak test passes against Niti's env.

**If RED at midnight:** ship with binary correct/incorrect reward only. Drop CWE bonus and exploit-sketch bonus. Tier the reward later if time allows.

---

## Phase 2  Integration & Sleep (12:30 AM  7:00 AM Sunday)

### Task 2.1  Wire data + reward into Niti's env (12:30 AM  3:00 AM, 2.5 hours)

- [x] Sync with Niti  pull his latest `environment.py` 
- [x] Wire `reset()` to actually load from your JSONL: random sample, return diff + available_files

- [x] Wire `step()` to call your `compute_reward()` with the loaded ground truth (server-side, never returned to client)
- [x] Run 100 random episodes locally with a dummy random-action client:
  - No crashes
  - Reward distribution looks reasonable (not all zeros, not all -1.0)
  - Episode lengths bounded by step cap
- [x] Run `test_no_leak.py`  must pass
- [ ] Push env to HF Space: `cd commitguard && openenv push`
- [ ] Verify deployment: `curl https://<your-username>-commitguard.hf.space/health`
- [x] Hand off to Divyank in team channel: "HF Space live at [URL], ready for training integration"

### Task 2.2  Sleep (3:00 AM  7:00 AM, 4 hours)

- [x] Sleep. Alarm at 7:00 AM. Phone away.
- [x] You wake up, you do evaluation. Need clear head for plotting.

---

## Phase 3  Evaluation & Plots (7:00 AM  10:00 AM Sunday)

### Task 3.1  Held-out test set (7:00 AM  7:30 AM)

- [x] Carve out 100 samples from the JSONL that were NOT used in training
- [x] Save as `data/devign_test.jsonl`
- [x] These 100 samples will be your evaluation set for both baseline and trained model

### Task 3.2  Baseline measurement (7:30 AM  8:30 AM, 1 hour)

- [x] Coordinate with Divyank  get the baseline (untrained) Llama-3.2-3B model loaded
- [x] Run all 100 test samples through baseline:
  - For each sample, prompt the model with the diff, parse its verdict from XML tags
  - Compute: vulnerability detection accuracy, per-CWE accuracy
- [x] Save raw results to `eval_baseline.json`

### Task 3.3  Trained model measurement (8:30 AM  9:30 AM, 1 hour)

- [x] Once Divyank's training run completes (should be done by ~5:30 AM, results in Wandb)
- [x] Load LoRA-adapted Llama-3.2-3B
- [x] Run same 100 test samples through trained model
- [x] Compute same metrics
- [x] Save raw results to `eval_trained.json`

### Task 3.4  Generate plots (9:30 AM  10:00 AM, 30 min)

Use Niti's plot scripts in `plots/` (he writes them in his early shift). You feed them data, they produce PNGs.

- [x] Reward curve plot  from Wandb training logs, save as `plots/reward_curve.png`
  - X-axis: training step
  - Y-axis: mean reward
  - Title: "CommitGuard  GRPO Training Reward Curve"
- [x] Baseline vs Trained accuracy bar chart, save as `plots/baseline_vs_trained.png`
  - Two bars: baseline accuracy, trained accuracy
  - Both numbers labeled on the bars
- [x] Per-CWE breakdown, save as `plots/per_cwe.png`
  - Grouped bar: each CWE has baseline + trained bar
  - Shows which vuln types the model learned fastest
- [x] All plots: axes labeled with units, title, readable from 5 feet away (page 28 reminder)
- [x] Commit all PNGs to repo

---

## Phase 4  Submission support (10:00 AM  5:00 PM Sunday)

### Task 4.1  Numbers handoff to Niti (10:00 AM  10:30 AM)

- [x] Send Niti the headline numbers in plain text:
  - "Baseline accuracy: X%"
  - "Trained accuracy: Y%"
  - "Best CWE improvement: CWE-XX, +Z%"
  - "Total training steps: N"
- [ ] He drops these into the README

### Task 4.2  Stretch: ablation (10:30 AM  1:00 PM, optional, only if Tier 1 done)

- [ ] If time allows, run a second eval comparing trained model on samples it saw during training vs held-out samples
- [ ] This shows generalization, strengthens results section
- [ ] Skip if Niti or Divyank need help instead

### Task 4.3  Lunch + buffer (1:00 PM  5:00 PM)

- [ ] Eat
- [ ] Be available for last-minute eval re-runs if something breaks
- [ ] Help with smoke testing the HF Space from a different network

---

## Sync points

- **12:00 AM Midnight**  Team sync. Report: data //, reward //, leak test //
- **9:00 AM Sunday**  Team sync. Report: baseline numbers, trained numbers, plot status
- **3:00 PM Sunday**  Final sync. Stop adding features.

---

## Fallback rules

- **Devign is too messy / inconsistent:** drop to a smaller, cleaner subset. 1000 high-quality samples beat 5000 noisy ones.
- **CWE keyword matching is too brittle:** drop the exploit-sketch bonus. Reward becomes 1.0 (correct) / 0.5 (correct + CWE) / penalties unchanged. Simpler, still tiered.
- **Training run produces no learning curve:** that's not your problem to fix  Divyank owns it. You produce evaluation honestly. If trained  baseline, that's the truth, ship it. The pitch can pivot to "we built the env, training is future work"  page 26 says "evidence that you trained" not "evidence that training worked perfectly."
- **You can't get the trained model to load:** ask Divyank for raw outputs from the training run, evaluate from those instead.