Spaces:
Runtime error
Runtime error
File size: 20,724 Bytes
b74db43 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 | # CommitGuard Product Requirements Document
**Project:** CommitGuard
**Owner:** Niti (Inmodel Labs)
**Team:** Niti, Deepak, Divyank
**Submission deadline:** Sunday 5:00 PM IST
**Hackathon:** Meta OpenEnv Hackathon (PyTorch + Hugging Face + Scaler)
**Document status:** Locked. Scope freeze at midnight Saturday.
---
## 1. Executive Summary
CommitGuard is a Reinforcement Learning environment built on Meta OpenEnv that trains LLM agents to detect exploitable vulnerabilities in code commits. The submission demonstrates that AI-paced security review is feasible that an agent trained on commit-level reasoning can match the velocity at which AI coding agents are now shipping production code.
The deliverable is a runnable HF Space hosting the env, a training notebook that produces a measurable learning curve on Llama-3.2-3B-Instruct, a demo video showing the qualitative shift from untrained to trained behavior, and a README that tells the story.
---
## 2. Problem Statement
### 2.1 The shift in software development
Until recently, code was written by humans at human velocity. Security review processes were designed around this assumption periodic pentests every 3 to 6 months, with manual code review at PR time. The cycle worked because the codebase changed slowly enough that periodic deep review caught most issues before they reached production.
This assumption has broken. Code is now being written and shipped by AI coding agents Claude Code, Cursor, autonomous coding agents at 10 to 100 times human velocity. Companies push to production daily, sometimes hourly. A pentest report from six months ago describes a codebase that no longer exists.
### 2.2 The asymmetry
The same class of LLM that writes the code can be weaponized to attack it. An adversary equipped with autonomous coding tooling, given repository access or even just leaked commits, can pentest at the same velocity defenders ship. Defense runs on human time. Offense runs on AI time. **This asymmetry is unsustainable for any organization shipping AI-generated code at scale.**
### 2.3 Why this is a frontier problem
AI red-teaming today is overwhelmingly a manual, human-bottlenecked discipline. Researchers at Anthropic, OpenAI, and Meta craft attacks one at a time. There is no automated equivalent of Metasploit for AI-generated code. Closing that gap is an open research problem that frontier labs are actively investing in.
---
## 3. Goals and Non-Goals
### 3.1 Goals (in scope for this submission)
- Deliver a working OpenEnv environment that takes a code commit as input and rewards an agent for correctly identifying vulnerabilities, the CWE class, and a plausible exploit
- Train a small Llama variant (Llama-3.2-3B-Instruct) on the env using GRPO via TRL + Unsloth
- Demonstrate measurable learning baseline vs. trained accuracy with reward curves
- Ship a complete submission package: HF Space, training notebook, README, demo video, optional HF blog post
- Frame the work in language a Meta researcher recognizes: RLVR (Reinforcement Learning from Verifiable Rewards), commit-time security, AI-paced defense
### 3.2 Non-goals (explicitly out of scope)
- Production-ready security tool this is a research environment, not a CI plugin
- Real-time exploit execution against arbitrary code the v1 reward uses pattern matching, not sandboxed execution
- Multi-file / repo-level reasoning v1 operates on single-file commits up to 80 lines
- Multi-agent self-play listed in Future Work
- Pentesting beyond static code analysis no network attacks, social engineering, or runtime probing
- Coverage of all CWEs v1 focuses on the top 10 CWEs in Devign
### 3.3 Non-goals from the rubric perspective
The rubric rewards ambition and storytelling more heavily than engineering polish. Therefore: not pursuing exhaustive test coverage, not optimizing for inference latency, not building a fancy frontend. The HF Space's default web UI is sufficient.
---
## 4. Target Users and Stakeholders
| Stakeholder | Role | What they care about |
|---|---|---|
| Hackathon judges (Meta partner engineers) | Primary audience | Innovation, story, training evidence, reward design |
| Meta Superintelligence Labs researchers | Aspirational audience | Frontier framing, RLVR alignment, paper-worthiness |
| HF community | Discovery audience | Reproducibility, runnable Space, clean README |
| Future contributors | Builder audience | Code clarity, extensibility hooks for v2 |
---
## 5. Solution Overview
### 5.1 The environment
CommitGuard is an OpenEnv environment where an agent investigates code commits and decides whether they introduce exploitable vulnerabilities. The agent has limited investigation budget (5 steps maximum per episode), forcing it to reason efficiently rather than brute-forcing context.
### 5.2 The agent loop
1. `reset()` env loads a commit (a `code_before`/`code_after` pair plus metadata) from a preprocessed Devign-derived dataset, returns the diff and the list of available files in the repo
2. `step(action)` agent emits one of three action types:
- `request_context(file_path)` pull surrounding code (small reward penalty, encourages efficiency)
- `analyze(reasoning)` write chain-of-thought, no reward effect, logged for traces
- `verdict(is_vulnerable, vuln_type, exploit_sketch)` terminate the episode with a judgment
3. Reward fires on verdict, computed server-side against ground truth the agent never sees
### 5.3 Reward design (RLVR philosophy)
The reward is tiered and grounded in dataset truth, not in another LLM's opinion. This is deliberate it follows the RLVR tradition (verifiable rewards from ground truth or executable checks) and prevents the reward hacking that plagues LLM-as-judge setups.
| Signal | Reward |
|---|---|
| Correct binary verdict (vulnerable vs. safe) | +1.0 |
| Correct CWE classification (when vulnerable) | +0.5 |
| Plausible exploit sketch (CWE-keyword match) | +0.5 |
| False positive (safe flagged as vulnerable) | -1.0 |
| False negative (real vuln missed) | -0.5 |
| Per-step context request | -0.05 |
| Episode step cap | 5 steps |
The shape is hard to game flagging everything is punished by false positives, never investigating means no exploit sketch bonus.
---
## 6. Technical Architecture
### 6.1 System diagram
```
HTTP/JSON
TRL + Unsloth HF Space
Llama-3.2-3B reset/step FastAPI server
GRPO trainer /state (Docker)
(HF Jobs A10G)
Devign
JSONL
Reward
function
```
### 6.2 Component breakdown
**Env server** (Python, FastAPI, Docker, OpenEnv 0.2.3+)
- `models.py` Action, Observation, State dataclasses (extends OpenEnv base classes)
- `environment.py` `reset()`, `step()`, `state()` methods on the `CommitGuardEnvironment` class
- `reward.py` pure function `compute_reward(action, ground_truth, cwe_keywords) -> float`
- `parse_action.py` XML-tag parser, robust to malformed model output
- `data/devign_filtered.jsonl` preprocessed dataset, shipped in image
- `data/cwe_keywords.json` top-10 CWE exploit-pattern keyword map
**Env client** (auto-generated by OpenEnv CLI)
- `client.py` `HTTPEnvClient` subclass, used by training notebook
- Installable via `pip install git+https://huggingface.co/spaces/<user>/commitguard`
**Training pipeline** (Python, TRL, Unsloth, PEFT, Wandb)
- `train_grpo.py` GRPOTrainer config + main loop
- `agent_prompt.py` system prompt template with XML-tag action format
- `evaluate.py` runs N samples through a model, returns accuracy stats
**Storytelling artifacts**
- `README.md` pitch + results + links
- `demo_video.mp4` 60-90 second before/after, hosted on YouTube unlisted
- `commitguard_hf_blog.md` optional HF Hub blog post (page 26 bonus)
- `plots/` reward_curve.png, baseline_vs_trained.png, per_cwe.png
### 6.3 Data flow
1. Preprocess Devign once at build time `data/devign_filtered.jsonl` (~5000 samples, balanced, filtered to <80 LOC)
2. Build Docker image with JSONL embedded
3. `openenv push` deploys to HF Space
4. Training notebook connects to HF Space URL via the OpenEnv HTTP client
5. Each training step: GRPO generates 4 completions per prompt each runs a full episode in the env rewards collected policy updated via LoRA
6. Wandb logs reward curves, training loss, checkpoints saved every 50 steps
7. Final LoRA adapter saved to HF Hub for evaluation and demo
### 6.4 Cheating prevention
The agent must never see ground truth. Enforced by architecture:
- Ground truth lives only on the server, in the JSONL file the env loads from
- The Observation dataclass schema explicitly excludes `is_vulnerable`, `cwe_type`, and `target_file_with_label`
- A unit test (`test_no_leak.py`) asserts no observation contains forbidden fields
- The server returns only `reward` (a scalar) on each step, never the label that produced it
---
## 7. Stack and Dependencies
### 7.1 Locked technical decisions
| Decision | Choice | Rationale |
|---|---|---|
| Env framework | Meta OpenEnv 0.2.3+ | Mandatory per submission rules |
| Server runtime | FastAPI in Docker | OpenEnv default, lowest friction |
| Hosting | HF Space | Mandatory per submission rules, three-in-one (server + repo + registry) |
| Data source | Devign (DetectBERT subset) | Already on disk, real CWE labels, manageable size |
| Model | Llama-3.2-3B-Instruct | Meta-branded for the Meta hackathon, fits A10G with GRPO |
| Training framework | TRL with GRPO | Native OpenEnv integration via `reward_funcs` callback |
| Training optimization | Unsloth 4-bit + LoRA r=8 | 70% memory reduction, 2x speed (page 75 of opening deck) |
| Training infra | HF Jobs A10G | $0.40-1.50/hr, runs unattended, integrates with HF ecosystem |
| Dev infra | GCP VM with T4 | Stable, no Colab disconnects, leverages 24,000 GCP credit |
| Action serialization | XML-tag free-text | Robust to small-model output variance, easier than JSON-mode |
| Logging | Wandb | TRL native, judges can view runs |
### 7.2 Fallback decisions (pre-approved, no debate when triggered)
| If this fails | Fall back to | Trigger |
|---|---|---|
| Llama-3.2-3B OOM on A10G | Qwen2.5-1.5B-Instruct | First test step crashes |
| HF Jobs queue full | GCP A10G on-demand | Job queues for >30 min |
| 3-action env doesn't ship by midnight | 2-action env (analyze + verdict) | Niti's checkpoint red |
| Tiered reward buggy | Binary correct/incorrect reward | Deepak's checkpoint red |
| Training curve flat | Ship with qualitative comparison only | Curve still flat at 10 AM Sunday |
| Demo video can't be cleanly recorded | Side-by-side text trace in README | Recording fails twice |
---
## 8. Functional Requirements
### 8.1 Environment functional requirements
| ID | Requirement | Priority |
|---|---|---|
| F-1 | Env exposes `/health`, `/reset`, `/step`, `/state`, `/docs` endpoints | P0 |
| F-2 | `reset()` returns a random commit observation, never the same one twice in a single episode | P0 |
| F-3 | `step()` accepts XML-tagged action strings and parses them robustly | P0 |
| F-4 | `step()` returns reward, observation, and done flag | P0 |
| F-5 | Episode terminates on `verdict` action OR after 5 steps | P0 |
| F-6 | Observation never contains ground-truth labels | P0 |
| F-7 | Env handles malformed actions gracefully (returns -0.5 reward, doesn't crash) | P1 |
| F-8 | Env supports concurrent episodes (multiple training generations in parallel) | P1 |
| F-9 | Web UI on HF Space allows manual interaction for demo recording | P2 |
### 8.2 Training functional requirements
| ID | Requirement | Priority |
|---|---|---|
| T-1 | Training notebook runs end-to-end on a single A10G | P0 |
| T-2 | Reward curve, training loss, and completions logged to Wandb | P0 |
| T-3 | LoRA adapter saved every 50 steps for resumability | P0 |
| T-4 | Baseline (untrained) evaluation on 100 held-out samples completes in <10 min | P0 |
| T-5 | Trained model evaluation produces per-CWE accuracy breakdown | P1 |
| T-6 | Notebook runnable from Colab via "Open in Colab" badge in README | P1 |
### 8.3 Storytelling functional requirements
| ID | Requirement | Priority |
|---|---|---|
| S-1 | README explains problem, env, results, and motivation in <5 min read | P0 |
| S-2 | All plot PNGs committed to repo (not Wandb-only) | P0 |
| S-3 | Demo video 60-90 sec, before/after on a single SQL injection example | P0 |
| S-4 | Wandb run URL linked in README | P1 |
| S-5 | HF Hub blog post published and linked | P2 |
---
## 9. Non-Functional Requirements
| Aspect | Requirement |
|---|---|
| Performance | Single `step()` call returns in <2 seconds on HF Space free tier |
| Reliability | Env survives 100 random episodes without crash |
| Reproducibility | Training notebook produces a measurable learning curve when re-run with same seed |
| Discoverability | HF Space tagged with `openenv`, `rl`, `security`, `code` |
| Documentation | README is self-contained judge can understand without reading source |
| Licensing | Code MIT-licensed, dataset attribution to Devign authors |
---
## 10. Success Metrics
### 10.1 Submission completeness (binary, must-pass)
- [ ] HF Space deployed and `/health` returns 200 OK
- [ ] Training notebook runs without crashes on a fresh Colab/VM
- [ ] README has all required links (HF Space, notebook, video, GitHub)
- [ ] At least one reward curve plot committed
- [ ] Demo video accessible via public URL
### 10.2 Quality metrics (graded by rubric)
| Metric | Target | Stretch |
|---|---|---|
| Innovation framing recognized by mentor | "this is an interesting angle" feedback | "this is paper-worthy" feedback |
| Baseline accuracy (untrained Llama-3.2-3B) | Establishes a floor (likely 30-45%) | |
| Trained accuracy (after 300 GRPO steps) | Beats baseline by 10pp absolute | Beats baseline by 20pp |
| Reward curve | Bends upward visibly | Smooth monotonic increase |
| Per-CWE breakdown | At least 3 CWEs show improvement | All top-5 CWEs show improvement |
| Storytelling | Mentor at Round 3 can repeat the pitch back | Mentor offers to share with Meta team |
### 10.3 Anti-metrics (things we explicitly don't optimize for)
- Number of features
- Number of CWEs covered (more is not better depth beats breadth here)
- Lines of code
- Model size (going larger doesn't make a stronger submission, just slower training)
---
## 11. Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Training run produces flat curve | Medium | High | Pre-approved pivot to qualitative-comparison narrative; baseline already establishes a contrast |
| HF Space deployment fails at 4 AM | Low | High | Fallback to Docker image with `docker run` instructions in README |
| Llama-3.2 license approval delayed | Low | Medium | Submit license request immediately at GCP setup; Qwen-1.5B fallback ready |
| Devign data has bad CWE labels | Medium | Medium | Filter aggressively; if too noisy, drop to top-5 cleanest CWEs only |
| One teammate falls behind their phase | Medium | High | Sync points at midnight, 9 AM, 3 PM allow scope cuts; mock-env pattern means training isn't blocked |
| Niti exhausted at Mentor Round 3 | High if no sleep | High | Mandatory sleep schedule 12:30 AM5:00 AM, non-negotiable |
| Demo video can't be cleanly recorded | Medium | Medium | Cherry-pick the best example; fall back to text trace if recording fails twice |
| HF Space rate limits during training | Low | Medium | Run training on local Docker if HF Space hits limits |
---
## 12. Timeline and Milestones
| Time (IST) | Milestone | Owner |
|---|---|---|
| Sat 9:30 PM | Phase 1 starts env scaffolding, data prep, training scaffolding in parallel | All |
| Sat 8:00 PM | Mentor Round 2 pitch validation | Niti |
| Sat 11:59 PM | Phase 1 checkpoint env runs, data ready, mock training works | All |
| Sun 12:00 AM | **Scope freeze** no new features after this point | All |
| Sun 12:30 AM | Niti sleep starts | Niti |
| Sun 3:00 AM | HF Space live, Deepak sleep starts | Deepak |
| Sun 5:30 AM | Real training run launched on HF Jobs, Divyank sleep starts | Divyank |
| Sun 5:00 AM | Niti wakes, watches training | Niti |
| Sun 9:00 AM | Team sync training results, plot status | All |
| Sun 10:00 AM | Mentor Round 3 final sharpening | Niti |
| Sun 11:30 AM | Demo video recorded and uploaded | Divyank |
| Sun 1:00 PM | README finalized | Niti |
| Sun 3:00 PM | **Feature freeze** 2-hour reminder, no more changes | All |
| Sun 4:30 PM | Submission packaged | Niti |
| Sun 5:00 PM | **Submission deadline** | |
---
## 13. Open Questions and Assumptions
### 13.1 Assumptions
- Devign dataset is on disk locally (or downloadable in <30 min) to be verified by Deepak at Phase 1 start
- HF Space free tier is sufficient for env hosting during the hackathon backup plan: $9/mo upgrade if rate limited
- Llama-3.2-3B-Instruct license approval lands within 1 hour of request Qwen fallback ready if not
- HF Jobs A10G availability at 5 AM Sunday GCP A10G fallback if queued
### 13.2 Open questions (to resolve during execution)
- Exact number of training steps to maximize curve visibility within budget answered empirically by 9 AM Sunday based on observed loss
- Whether to ship a Colab-runnable notebook AND an HF Jobs notebook, or just one defer to Divyank's call at Phase 2
- Whether to include a comparison against a non-RL baseline (pure SFT or zero-shot) stretch only
---
## 14. Future Work (Post-Hackathon)
This section becomes part of the README's "What's Next" pitch explicitly signals to judges that we understand the limitations and have a roadmap.
- **Sandboxed exploit execution** replace pattern-match reward with actual exploit runs against compiled code in a Docker sandbox
- **Multi-file commit reasoning** extend the env to support diffs spanning multiple files, with a context budget
- **Self-play loop** pair CommitGuard with a code-generation agent; defender and attacker train against each other (the AlphaGo pattern for security)
- **Agentic harness integration** wire into real CI pipelines via the OpenEnv MCP layer, enabling commit-time security review at PR open
- **Real CVE corpus** extend beyond Devign to recent CVE-tagged commits from major open-source repos
- **Multi-language support** current env is C-focused via Devign; extend to Python, JavaScript, Go
- **Reward shape ablations** formal study of how reward composition affects which vulnerability types the model learns fastest
---
## 15. Appendix
### 15.1 Key reference URLs (for the team to bookmark)
- OpenEnv repo: https://github.com/meta-pytorch/OpenEnv
- OpenEnv Scaler intro: https://tinyurl.com/openenv-scaler
- TRL OpenEnv docs: https://huggingface.co/docs/trl/en/openenv
- TRL Sudoku GRPO example: https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_sudoku_grpo.ipynb
- TRL Wordle GRPO example: https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb
- Unsloth 2048 example: https://github.com/meta-pytorch/OpenEnv/blob/main/tutorial/examples/unsloth_2048.ipynb
- Llama-3.2-3B model card: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
- HF Jobs docs: https://huggingface.co/docs/hub/jobs
- Cursor credits: https://tinyurl.com/sclr-openenv-dashboard
- HF $30 credits: https://huggingface.co/coupons/claim/hf-openenv-community
### 15.2 Document version
- v1.0 Saturday evening, Bangalore venue. Locked at midnight Saturday.
- Changes after lock require explicit team-wide sign-off and a documented rationale.
---
## 16. The 30-Second Pitch (For Mentor Rounds, Memorize This)
> "AI is now writing production code at AI speed. Security review still runs on a 6-month human cycle. The same LLMs that write the code can attack it defense is on human time, offense is on AI time, and that asymmetry breaks the security model.
>
> CommitGuard is an OpenEnv where an agent learns to flag exploitable diffs at commit time. We trained Llama-3.2-3B on it via GRPO and the detection rate climbs measurably. It's RLVR verifiable rewards from ground truth, not LLM judges. The thesis: continuous AI red-teaming at the velocity code is being shipped. This is the environment to train it." |