# Tasks Divyank (Evaluation + Storytelling) **Project:** CommitGuard OpenEnv Hackathon Submission **Submission deadline:** Sunday 5:00 PM IST **Your role:** Own the evaluation pipeline and the storytelling assets (demo video, HF blog post). You are the "Quality & Communications" lead. --- ## Phase 1 & 2 Foundation & Integration (Saturday Night) ### Task 1.1 Evaluation Script Hardening (2 hours) **Goal:** Take the baseline `scripts/evaluate.py` and make it a robust testing tool. - [x] Update `scripts/evaluate.py` to support multi-step episodes (up to 5 steps). - [x] Implement logic to handle the agent's `` XML outputs sequentially. - [x] Add support for loading a PEFT/LoRA adapter (which Niti will provide after training). - [x] Ensure it generates `eval_results.json` with a breakdown of Accuracy vs. CWE type. ### Task 1.2 Dataset Spot-Check (30 min) **Goal:** Verify the quality of Deepak's 5000-sample dataset in the `mvd` branch. - [x] Manually review 20 random samples from `data/devign_filtered.jsonl`. - [x] Ensure the `diff` and `cwe_type` are consistent and reasonable for a 3B model. --- ## Phase 3 Demo + Storytelling (Sunday Morning) ### Task 3.1 Baseline vs. Trained Evaluation (P0) **Goal:** Produce the data that Niti needs for the final plots. - [x] Run the hardened `evaluate.py` against 100 held-out samples using the **Untrained** model. - [x] Run it again once Niti provides the **Trained** LoRA adapter. - [x] Capture the delta: "Detection accuracy: X% -> Y%". ### Task 3.2 Demo Video Recording (P0) **Goal:** Create the visceral "Emotional Hook" for the judges. - [ ] Pick one "Hero Case" (e.g., a clear SQL Injection). - [ ] Record a 90-second side-by-side: Untrained fumbling vs. Trained reasoning and identifying the vulnerability. - [ ] Upload to YouTube as Unlisted and provide link for the README. ### Task 3.3 HF Hub Blog Post (P1) **Goal:** Hit the rubric requirement for community outreach. - [ ] Write a post on the HF Hub explaining the project, the RLVR reward design, and the results. - [ ] Embed the demo video and the reward plots generated by Niti. --- ## Sync Points - [x] **Midnight Saturday:** Confirm `evaluate.py` can handle multi-step XML traces. - [x] **9:00 AM Sunday:** Report final accuracy numbers (Baseline vs. Trained). - [ ] **3:00 PM Sunday:** Final link check for Video and Blog post.