arxiv:2604.15950

TwinTrack: Post-hoc Multi-Rater Calibration for Medical Image Segmentation

Published on Apr 17

· Submitted by

Tristan on Apr 20

MIC at DKFZ

Upvote

Authors:

Tristan Kirscher ,

Abstract

TwinTrack framework addresses pancreatic cancer segmentation ambiguity through post-hoc calibration of ensemble probabilities to empirical mean human response, improving calibration metrics on multi-rater benchmarks.

AI-generated summary

Pancreatic ductal adenocarcinoma (PDAC) segmentation on contrast-enhanced CT is inherently ambiguous: inter-rater disagreement among experts reflects genuine uncertainty rather than annotation noise. Standard deep learning approaches assume a single ground truth, producing probabilistic outputs that can be poorly calibrated and difficult to interpret under such ambiguity. We present TwinTrack, a framework that addresses this gap through post-hoc calibration of ensemble segmentation probabilities to the empirical mean human response (MHR) -the fraction of expert annotators labeling a voxel as tumor. Calibrated probabilities are thus directly interpretable as the expected proportion of annotators assigning the tumor label, explicitly modeling inter-rater disagreement. The proposed post-hoc calibration procedure is simple and requires only a small multi-rater calibration set. It consistently improves calibration metrics over standard approaches when evaluated on the MICCAI 2025 CURVAS-PDACVI multi-rater benchmark.

View arXiv page View PDF Add to collection

Community

Kirscher

Paper author Paper submitter 1 day ago

Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest cancers, and its segmentation on contrast-enhanced CT is fundamentally ambiguous: when experts disagree, that disagreement often reflects real uncertainty rather than annotation noise. TwinTrack is a simple post-hoc multi-rater calibration method that transforms ensemble segmentation probabilities into predictions aligned with the Mean Human Response, better capturing expert disagreement. In other words: not just better segmentation, but better-calibrated uncertainty for genuinely ambiguous clinical images.