--- title: AgentDebuggerEnv emoji: π colorFrom: purple colorTo: indigo sdk: gradio app_file: app.py pinned: false --- # AgentDebuggerEnv **Hackathon Links:** - π **[Live Hugging Face Space](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3)** - π **[Read the Technical Writeup](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3/blob/main/Blog.md)** ### π One-Line Pitch An OpenEnv-backed reinforcement learning environment that trains LLMs to debug code systematically via Group Relative Policy Optimization (GRPO) and secure sandbox execution. ### π‘ Why This Exists LLMs often hallucinate bug fixes via blind trial-and-error. Real debugging in production requires hypothesis-driven reasoning, isolation, and verification. We engineered an environment that forces models to observe, hypothesize, and execute code within a secure sandboxβpenalizing blind guessing and explicitly rewarding structured problem-solving. ### π§ Key Technical Insights & Research Foundations * **Hypothesis-Driven Debugging (NeurIPS 2025):** Recent research presented at NeurIPS demonstrates that forcing an LLM to formulate a concrete hypothesis before generating code significantly improves debugging accuracy. Inspired by this, our environment mandates a strict `OBSERVATION` β `HYPOTHESIS` β `ACTION` loop. Every single step taken by the agent must be preceded by a formal hypothesis to receive a positive reward. * **Literature-Backed Reward Criteria:** Our continuous, multi-objective reward shaping architecture is heavily influenced by the latest findings in LLM reasoning and code generation capabilities, specifically drawing from: * [arXiv:2408.10215](https://arxiv.org/abs/2408.10215) * [arXiv:2601.19100](https://arxiv.org/abs/2601.19100) * **Curriculum Learning for RL:** A flat bug distribution caused early policy collapse. We implemented a 3-tier curriculum, introducing complex logic bugs only after structural formatting and syntax localization stabilized. * **Hardened Sandboxed Grading:** Evaluating arbitrary LLM-generated fixes introduces severe RCE risks. We engineered a secure execution sandbox that restricts execution time, limits memory, and completely replaces unsafe `exec()` calls, ensuring deterministic and safe grading. ### ποΈ Architecture Overview * **OpenEnv Core:** Manages state transitions, agent interactions, and environment telemetry. * **Grader Subsystem:** Multi-layered evaluation utilizing a Hard Grader (secure execution, deterministic AST matching) and a Soft Grader (Llama-3.1-70B semantic evaluation). * **Trainer:** HuggingFace TRL GRPO pipeline with dynamic batch scaling based on runtime VRAM detection. * **Live Monitor:** A Gradio dashboard streaming `stdout` and Weights & Biases metrics directly from the active training container. ### β‘ What Makes This Impressive * **Zero-to-One in 250 Steps:** Achieved a ~2.5x increase in total reward within just 250 steps, demonstrating extreme sample efficiency via GRPO. * **Dynamic Hardware Scaling:** The training pipeline natively detects hardware capability (A100/H100 vs. T4) and automatically scales `batch_size`, `grad_accum`, and compute `dtype` (`bfloat16`/`float16`)βeliminating OOM errors across deployment environments. * **Frictionless Deployment:** Bypassed heavy dependency constraints (PyTorch/TRL vs. Gradio PIP conflicts) by engineering a lazy-loading runtime environment that ensures deterministic Docker builds. ### π οΈ Tech Stack * **Frameworks:** OpenEnv, FastAPI, Docker * **RL Pipeline:** HuggingFace TRL (GRPO), Peft (LoRA) * **Models:** Qwen2.5-Coder-7B-Instruct (Base), Llama-3.1-70B (Evaluator) * **Telemetry:** Weights & Biases ### π Results & Benchmarks Our training run clearly demonstrates rapid policy adaptation. The model successfully learned the `OBSERVATION/HYPOTHESIS/ACTION` constraint almost instantly and navigated the tier-2 difficulty bump (step 150) with a textbook drop-and-recover curve. ## Training Results [W&B Run](https://wandb.ai/shashaankjain07-keshav-memorial-college-of-law/AgentDebuggerEnv/runs/vylbqd5m?nw=nwusershashaankjain07) | [HF Blog](https://huggingface.co/spaces/shashaank0707/AgentDebugger-training-v3/blob/main/Blog.md) *(Note for Hackathon Judges: Live Weights & Biases charts and Gradio UI are embedded below as evidence of the training run).*   *Additional Training Metrics:*