Spaces:
Configuration error
Configuration error
Tasks Divyank (Evaluation + Storytelling)
Project: CommitGuard OpenEnv Hackathon Submission Submission deadline: Sunday 5:00 PM IST Your role: Own the evaluation pipeline and the storytelling assets (demo video, HF blog post). You are the "Quality & Communications" lead.
Phase 1 & 2 Foundation & Integration (Saturday Night)
Task 1.1 Evaluation Script Hardening (2 hours)
Goal: Take the baseline scripts/evaluate.py and make it a robust testing tool.
- Update
scripts/evaluate.pyto support multi-step episodes (up to 5 steps). - Implement logic to handle the agent's
<action>XML outputs sequentially. - Add support for loading a PEFT/LoRA adapter (which Niti will provide after training).
- Ensure it generates
eval_results.jsonwith a breakdown of Accuracy vs. CWE type.
Task 1.2 Dataset Spot-Check (30 min)
Goal: Verify the quality of Deepak's 5000-sample dataset in the mvd branch.
- Manually review 20 random samples from
data/devign_filtered.jsonl. - Ensure the
diffandcwe_typeare consistent and reasonable for a 3B model.
Phase 3 Demo + Storytelling (Sunday Morning)
Task 3.1 Baseline vs. Trained Evaluation (P0)
Goal: Produce the data that Niti needs for the final plots.
- Run the hardened
evaluate.pyagainst 100 held-out samples using the Untrained model. - Run it again once Niti provides the Trained LoRA adapter.
- Capture the delta: "Detection accuracy: X% -> Y%".
Task 3.2 Demo Video Recording (P0)
Goal: Create the visceral "Emotional Hook" for the judges.
- Pick one "Hero Case" (e.g., a clear SQL Injection).
- Record a 90-second side-by-side: Untrained fumbling vs. Trained reasoning and identifying the vulnerability.
- Upload to YouTube as Unlisted and provide link for the README.
Task 3.3 HF Hub Blog Post (P1)
Goal: Hit the rubric requirement for community outreach.
- Write a post on the HF Hub explaining the project, the RLVR reward design, and the results.
- Embed the demo video and the reward plots generated by Niti.
Sync Points
- Midnight Saturday: Confirm
evaluate.pycan handle multi-step XML traces. - 9:00 AM Sunday: Report final accuracy numbers (Baseline vs. Trained).
- 3:00 PM Sunday: Final link check for Video and Blog post.