PatchJudge — Post-Test Code Quality Scorer for AI Coding Agents

PatchJudge evaluates whether AI-generated code patches are actually good — not just whether they pass unit tests.

The Problem

The entire industry evaluates coding agents by "does the test suite pass?" — and this is broken:

  • OpenAI abandoned SWE-bench Verified because 16.4%+ of test cases are flawed
  • METR found ~50% of test-passing PRs wouldn't be merged into real codebases
  • 7.8% of "correct" patches are actually wrong — they pass tests but are incomplete/broken (PatchDiff, 2503.15223)
  • 23% of SWE-bench tasks can be "solved" by a trivial regex patch

The Solution: MergeScore

PatchJudge scores every patch on 5 dimensions, then computes a single MergeScore (0-100):

Dimension What It Measures Weight
Correctness Does the fix address the actual issue, not just the test? 30%
Completeness Are edge cases handled? Error handling present? 20%
Code Quality Clean, idiomatic, maintainable, follows conventions? 20%
Non-Regression Risk Could this break unrelated functionality? 15%
Merge-Readiness Would a senior engineer approve this PR as-is? 15%

Architecture

Input: {issue_text, patch_diff, test_results, repo_context}
                    │
   ┌────────────────┼────────────────┐
   ▼                ▼                ▼
Feature         LLM Judge        Score
Extractor       (structured      Aggregator
(AST, diff      5-dimension      (weighted avg
 stats)         eval)            → MergeScore)

Components

1. Data Loader (patchjudge/data_loader.py)

  • Loads SWE-bench Verified (500 gold-standard tasks)
  • Collects agent patches from multiple sources:
    • CoderForge (Qwen3-Coder-32B): 500 instances, 297 passed
    • OpenHands+O1: 499 instances, 229 passed
    • SWE-bench S3 bucket: 139 verified agent submissions
  • Builds unified PatchExample format

2. Feature Extractor (patchjudge/feature_extractor.py)

  • AST-based analysis for Python patches
  • Diff statistics (files, lines, hunks, scope)
  • Issue-patch alignment via keyword matching
  • Code quality signals (TODOs, hardcoded values, debug statements)
  • Risk assessment (core file modifications, scope analysis)

3. LLM Judge (patchjudge/judge.py)

  • Uses Qwen2.5-Coder-32B-Instruct via HF Inference API
  • Structured JSON output with reasoning per dimension
  • Temperature 0.1 for scoring consistency
  • Robust JSON parsing with retry logic

4. Validation (patchjudge/validation.py)

  • METR alignment check (~50% of test-passing patches should score below 50)
  • Known-bad pattern detection (hardcoded returns, broad try/except, test disabling)
  • Resolved vs. unresolved separation analysis
  • Per-dimension statistical analysis

Dataset

999 patch examples from SWE-bench Verified:

  • 526 test-passing patches across 2 agents
  • 473 test-failing patches
  • 126 synthetically generated known-bad patches for validation
  • Features extracted for all examples

Evaluation Results (v1)

Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:

Score Distribution

Metric Value
Mean MergeScore 50.6/100
Median MergeScore 49.5/100
Std Dev 13.8
Score range 23.0 – 80.5

METR Alignment ✅

  • 50% of test-passing patches scored below 50 — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
  • Test-passing mean: 50.9, Test-failing mean: 42.5
  • Clear separation between resolved and unresolved patches

Per-Dimension Averages (0-10 scale)

Dimension Mean Std
Correctness 5.8 1.9
Completeness 4.3 1.3
Code Quality 5.1 1.8
Non-Regression Risk 5.2 1.8
Merge-Readiness 4.5 1.7

Per-Agent Comparison

Agent Mean MergeScore Patches
CoderForge (Qwen3-32B) 49.9 52
OpenHands+O1 52.5 20

Known-Bad Detection

In earlier testing, the judge correctly identified known-bad patterns:

  • noop patch (just adds pass): 18.5/100
  • broad try/except patches: flagged as low quality
  • hardcoded returns: flagged as non-genuine fixes

Quick Start

from patchjudge.judge import PatchJudge, quick_judge

# One-shot evaluation
result = quick_judge(
    problem_statement="Fix divide by zero in calculate_average",
    agent_patch="diff --git a/utils.py...",
    gold_patch="diff --git a/utils.py...",
    test_passed=True,
)

print(f"MergeScore: {result.merge_score}/100")
print(result.summary())

Batch Evaluation

python run_patchjudge.py --sources coderforge,o1 --judge-count 100 --validate-known-bad

Key Research References

  • PatchDiff — "Are 'Solved Issues' in SWE-bench Really Solved Correctly?"
  • CodeJudgeBench — "Benchmarking LLM-as-a-Judge for Coding Tasks"
  • SWE-smith — "Scaling Data for Software Engineering Agents"
  • UTBoost — "Rigorous Evaluation of Coding Agents on SWE-Bench"

Data Sources

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for VD10/PatchJudge