CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
Paper • 2507.10535 • Published
PatchJudge evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.
The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:
PatchJudge scores every patch on 5 dimensions, then computes a single MergeScore (0-100):
| Dimension | What It Measures | Weight |
|---|---|---|
| Correctness | Does the fix address the actual issue, not just the test? | 30% |
| Completeness | Are edge cases handled? Error handling present? | 20% |
| Code Quality | Clean, idiomatic, maintainable, follows conventions? | 20% |
| Non-Regression Risk | Could this break unrelated functionality? | 15% |
| Merge-Readiness | Would a senior engineer approve this PR as-is? | 15% |
Input: {issue_text, patch_diff, test_results, repo_context}
│
┌────────────────┼────────────────┐
▼ ▼ ▼
Feature LLM Judge Score
Extractor (structured Aggregator
(AST, diff 5-dimension (weighted avg
stats) eval) → MergeScore)
patchjudge/data_loader.py)
PatchExample formatpatchjudge/feature_extractor.py)
patchjudge/judge.py)
patchjudge/validation.py)
999 patch examples from SWE-bench Verified:
Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
| Metric | Value |
|---|---|
| Mean MergeScore | 50.6/100 |
| Median MergeScore | 49.5/100 |
| Std Dev | 13.8 |
| Score range | 23.0 – 80.5 |
| Dimension | Mean | Std |
|---|---|---|
| Correctness | 5.8 | 1.9 |
| Completeness | 4.3 | 1.3 |
| Code Quality | 5.1 | 1.8 |
| Non-Regression Risk | 5.2 | 1.8 |
| Merge-Readiness | 4.5 | 1.7 |
| Agent | Mean MergeScore | Patches |
|---|---|---|
| CoderForge (Qwen3-32B) | 49.9 | 52 |
| OpenHands+O1 | 52.5 | 20 |
In earlier testing, the judge correctly identified known-bad patterns:
pass): 18.5/100from patchjudge.judge import PatchJudge, quick_judge
# One-shot evaluation
result = quick_judge(
problem_statement="Fix divide by zero in calculate_average",
agent_patch="diff --git a/utils.py...",
gold_patch="diff --git a/utils.py...",
test_passed=True,
)
print(f"MergeScore: {result.merge_score}/100")
print(result.summary())
python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad
s3://swe-bench-submissions/verified/MIT