# Tasks  Divyank (Evaluation + Storytelling)

**Project:** CommitGuard  OpenEnv Hackathon Submission
**Submission deadline:** Sunday 5:00 PM IST
**Your role:** Own the evaluation pipeline and the storytelling assets (demo video, HF blog post). You are the "Quality & Communications" lead.

---

## Phase 1 & 2  Foundation & Integration (Saturday Night)

### Task 1.1  Evaluation Script Hardening (2 hours)
**Goal:** Take the baseline `scripts/evaluate.py` and make it a robust testing tool.
- [x] Update `scripts/evaluate.py` to support multi-step episodes (up to 5 steps).
- [x] Implement logic to handle the agent's `<action>` XML outputs sequentially.
- [x] Add support for loading a PEFT/LoRA adapter (which Niti will provide after training).
- [x] Ensure it generates `eval_results.json` with a breakdown of Accuracy vs. CWE type.

### Task 1.2  Dataset Spot-Check (30 min)
**Goal:** Verify the quality of Deepak's 5000-sample dataset in the `mvd` branch.
- [x] Manually review 20 random samples from `data/devign_filtered.jsonl`.
- [x] Ensure the `diff` and `cwe_type` are consistent and reasonable for a 3B model.

---

## Phase 3  Demo + Storytelling (Sunday Morning)

### Task 3.1  Baseline vs. Trained Evaluation (P0)
**Goal:** Produce the data that Niti needs for the final plots.
- [x] Run the hardened `evaluate.py` against 100 held-out samples using the **Untrained** model.
- [x] Run it again once Niti provides the **Trained** LoRA adapter.
- [x] Capture the delta: "Detection accuracy: X% -> Y%".

### Task 3.2  Demo Video Recording (P0)
**Goal:** Create the visceral "Emotional Hook" for the judges.
- [ ] Pick one "Hero Case" (e.g., a clear SQL Injection).
- [ ] Record a 90-second side-by-side: Untrained fumbling vs. Trained reasoning and identifying the vulnerability.
- [ ] Upload to YouTube as Unlisted and provide link for the README.

### Task 3.3  HF Hub Blog Post (P1)
**Goal:** Hit the rubric requirement for community outreach.
- [ ] Write a post on the HF Hub explaining the project, the RLVR reward design, and the results.
- [ ] Embed the demo video and the reward plots generated by Niti.

---

## Sync Points
- [x] **Midnight Saturday:** Confirm `evaluate.py` can handle multi-step XML traces.
- [x] **9:00 AM Sunday:** Report final accuracy numbers (Baseline vs. Trained).
- [ ] **3:00 PM Sunday:** Final link check for Video and Blog post.