shank commited on
Commit Β·
eacdf84
1
Parent(s): 3165754
Added blog post
Browse files
Blog.md
ADDED
|
File without changes
|
README.md
CHANGED
|
@@ -25,7 +25,7 @@ LLMs often hallucinate bug fixes via blind trial-and-error. Real debugging in pr
|
|
| 25 |
* **Hypothesis-Driven Debugging (NeurIPS 2025):** Recent research presented at NeurIPS demonstrates that forcing an LLM to formulate a concrete hypothesis before generating code significantly improves debugging accuracy. Inspired by this, our environment mandates a strict `OBSERVATION` β `HYPOTHESIS` β `ACTION` loop. Every single step taken by the agent must be preceded by a formal hypothesis to receive a positive reward.
|
| 26 |
* **Literature-Backed Reward Criteria:** Our continuous, multi-objective reward shaping architecture is heavily influenced by the latest findings in LLM reasoning and code generation capabilities, specifically drawing from:
|
| 27 |
* [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
|
| 28 |
-
* [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
|
| 29 |
* **Curriculum Learning for RL:** A flat bug distribution caused early policy collapse. We implemented a 3-tier curriculum, introducing complex logic bugs only after structural formatting and syntax localization stabilized.
|
| 30 |
* **Hardened Sandboxed Grading:** Evaluating arbitrary LLM-generated fixes introduces severe RCE risks. We engineered a secure execution sandbox that restricts execution time, limits memory, and completely replaces unsafe `exec()` calls, ensuring deterministic and safe grading.
|
| 31 |
|
|
@@ -50,7 +50,7 @@ LLMs often hallucinate bug fixes via blind trial-and-error. Real debugging in pr
|
|
| 50 |
Our training run clearly demonstrates rapid policy adaptation. The model successfully learned the `OBSERVATION/HYPOTHESIS/ACTION` constraint almost instantly and navigated the tier-2 difficulty bump (step 150) with a textbook drop-and-recover curve.
|
| 51 |
|
| 52 |
## Training Results
|
| 53 |
-
[W&B Run](https://wandb.ai/shashaankjain07-keshav-memorial-college-of-law/AgentDebuggerEnv/runs/vylbqd5m?nw=nwusershashaankjain07) | [
|
| 54 |
|
| 55 |
*(Note for Hackathon Judges: Live Weights & Biases charts and Gradio UI are embedded below as evidence of the training run).*
|
| 56 |
|
|
@@ -104,5 +104,5 @@ The easiest way to re-run the exact GRPO training pipeline is via our Jupyter No
|
|
| 104 |
---
|
| 105 |
|
| 106 |
### π₯ Team Endurance
|
| 107 |
-
* **Shashaank Jain** | GitHub: [@shasshaank](https://github.com/shasshaank) | Email: *[
|
| 108 |
* **[Pranav Pulipati]** | GitHub: *[@PulipatiPranav](https://github.com/PulipatiPranav)* | Email: *[pranavpulipatix@gmail.com]*
|
|
|
|
| 25 |
* **Hypothesis-Driven Debugging (NeurIPS 2025):** Recent research presented at NeurIPS demonstrates that forcing an LLM to formulate a concrete hypothesis before generating code significantly improves debugging accuracy. Inspired by this, our environment mandates a strict `OBSERVATION` β `HYPOTHESIS` β `ACTION` loop. Every single step taken by the agent must be preceded by a formal hypothesis to receive a positive reward.
|
| 26 |
* **Literature-Backed Reward Criteria:** Our continuous, multi-objective reward shaping architecture is heavily influenced by the latest findings in LLM reasoning and code generation capabilities, specifically drawing from:
|
| 27 |
* [arXiv:2408.10215](https://arxiv.org/abs/2408.10215)
|
| 28 |
+
* [arXiv:2601.19100](https://arxiv.org/abs/2601.19100)
|
| 29 |
* **Curriculum Learning for RL:** A flat bug distribution caused early policy collapse. We implemented a 3-tier curriculum, introducing complex logic bugs only after structural formatting and syntax localization stabilized.
|
| 30 |
* **Hardened Sandboxed Grading:** Evaluating arbitrary LLM-generated fixes introduces severe RCE risks. We engineered a secure execution sandbox that restricts execution time, limits memory, and completely replaces unsafe `exec()` calls, ensuring deterministic and safe grading.
|
| 31 |
|
|
|
|
| 50 |
Our training run clearly demonstrates rapid policy adaptation. The model successfully learned the `OBSERVATION/HYPOTHESIS/ACTION` constraint almost instantly and navigated the tier-2 difficulty bump (step 150) with a textbook drop-and-recover curve.
|
| 51 |
|
| 52 |
## Training Results
|
| 53 |
+
[W&B Run](https://wandb.ai/shashaankjain07-keshav-memorial-college-of-law/AgentDebuggerEnv/runs/vylbqd5m?nw=nwusershashaankjain07) | [HF Blog](#)
|
| 54 |
|
| 55 |
*(Note for Hackathon Judges: Live Weights & Biases charts and Gradio UI are embedded below as evidence of the training run).*
|
| 56 |
|
|
|
|
| 104 |
---
|
| 105 |
|
| 106 |
### π₯ Team Endurance
|
| 107 |
+
* **Shashaank Jain** | GitHub: [@shasshaank](https://github.com/shasshaank) | Email: *[shashaankjain07@gmail.com]*
|
| 108 |
* **[Pranav Pulipati]** | GitHub: *[@PulipatiPranav](https://github.com/PulipatiPranav)* | Email: *[pranavpulipatix@gmail.com]*
|