Spaces:

Precise-Debugging-Benchmarking
/

README

Configuration error

App Files Files Community

Bill1235813 commited on 5 days ago

Commit

653567e

verified ·

1 Parent(s): 0f6620f

Add / refresh organization card

Browse files

Files changed (1) hide show

README.md +47 -7

README.md CHANGED Viewed

@@ -1,10 +1,50 @@
 ---
-title: README
-emoji: 🏆
-colorFrom: green
-colorTo: green
-sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

 ---
+license: mit
 ---
+# Precise Debugging Benchmarking (PDB)
+📄 Paper *(coming soon)* &nbsp;·&nbsp;
+💻 [Code](https://github.com/Bill1235813/PDB) &nbsp;·&nbsp;
+🌐 [Project page](https://precise-debugging-benchmark.github.io/) &nbsp;·&nbsp;
+🏆 [Leaderboard](https://precise-debugging-benchmark.github.io/leaderboard.html)
+**PDB** is an automatic pipeline that turns any coding dataset into a
+*debugging* benchmark with fine-grained metrics. Beyond binary unit-test
+scores, PDB evaluates a debugger with **edit-level precision** (did the model
+touch only the lines it had to?) and **bug-level recall** (did it fix every
+fault?). This rewards targeted fixes and penalizes the regeneration behavior
+frontier LLMs often fall back on.
+> Frontier models like GPT-5.1-Codex and DeepSeek-V3.2-Thinking top unit-test
+> leaderboards (>76%) but score at or below 45% on precision: they pass tests
+> by rewriting, not repairing. PDB makes that gap measurable.
+## Released datasets
+| Dataset | Size | Bug granularity | Notes |
+|---|---|---|---|
+| [PDB-Single](https://huggingface.co/datasets/Precise-Debugging-Benchmarking/PDB-Single) | 7,589 | single line | full initial pool before easy-case filtering |
+| [PDB-Single-Hard](https://huggingface.co/datasets/Precise-Debugging-Benchmarking/PDB-Single-Hard) | 5,751 | single line | hard subset: tasks not easily solved by 7+ of 9 reference models |
+| [PDB-Multi](https://huggingface.co/datasets/Precise-Debugging-Benchmarking/PDB-Multi) | 256 | 2&ndash;4 line blocks | multi-line extension on programs with ≥35 LOC; atomicity-filtered |
+All three are derived from [BigCodeBench](https://huggingface.co/datasets/bigcode/bigcodebench)
+and [LiveCodeBench](https://huggingface.co/datasets/livecodebench/execution),
+sourced via the PDB pipeline, and evaluated with precision / recall / unit-test
+pass rate.
+## Citation
+```
+@inproceedings{zhu2026pdb,
+  title     = {Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?},
+  author    = {Zhu, Wang Bill and Chai, Miaosen and Wang, Shangshang and Liu, Yejia and
+               Bian, Song and Dong, Honghua and Neiswanger, Willie and Jia, Robin},
+  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
+  year      = {2026},
+}
+```
+## Contact
+Questions / submissions: wangzhu@usc.edu, miaosenc@usc.edu.