commitguard / tasks_divyank.md
Nitishkumar-ai's picture
Upload folder using huggingface_hub
e4f3d12 verified

Tasks Divyank (Evaluation + Storytelling)

Project: CommitGuard OpenEnv Hackathon Submission Submission deadline: Sunday 5:00 PM IST Your role: Own the evaluation pipeline and the storytelling assets (demo video, HF blog post). You are the "Quality & Communications" lead.


Phase 1 & 2 Foundation & Integration (Saturday Night)

Task 1.1 Evaluation Script Hardening (2 hours)

Goal: Take the baseline scripts/evaluate.py and make it a robust testing tool.

  • Update scripts/evaluate.py to support multi-step episodes (up to 5 steps).
  • Implement logic to handle the agent's <action> XML outputs sequentially.
  • Add support for loading a PEFT/LoRA adapter (which Niti will provide after training).
  • Ensure it generates eval_results.json with a breakdown of Accuracy vs. CWE type.

Task 1.2 Dataset Spot-Check (30 min)

Goal: Verify the quality of Deepak's 5000-sample dataset in the mvd branch.

  • Manually review 20 random samples from data/devign_filtered.jsonl.
  • Ensure the diff and cwe_type are consistent and reasonable for a 3B model.

Phase 3 Demo + Storytelling (Sunday Morning)

Task 3.1 Baseline vs. Trained Evaluation (P0)

Goal: Produce the data that Niti needs for the final plots.

  • Run the hardened evaluate.py against 100 held-out samples using the Untrained model.
  • Run it again once Niti provides the Trained LoRA adapter.
  • Capture the delta: "Detection accuracy: X% -> Y%".

Task 3.2 Demo Video Recording (P0)

Goal: Create the visceral "Emotional Hook" for the judges.

  • Pick one "Hero Case" (e.g., a clear SQL Injection).
  • Record a 90-second side-by-side: Untrained fumbling vs. Trained reasoning and identifying the vulnerability.
  • Upload to YouTube as Unlisted and provide link for the README.

Task 3.3 HF Hub Blog Post (P1)

Goal: Hit the rubric requirement for community outreach.

  • Write a post on the HF Hub explaining the project, the RLVR reward design, and the results.
  • Embed the demo video and the reward plots generated by Niti.

Sync Points

  • Midnight Saturday: Confirm evaluate.py can handle multi-step XML traces.
  • 9:00 AM Sunday: Report final accuracy numbers (Baseline vs. Trained).
  • 3:00 PM Sunday: Final link check for Video and Blog post.