This research audits the use of ensemble disagreement as a proxy for uncertainty in medical image segmentation, finding that Deep Ensembles (DE) provide superior calibration and failure detection compared to standard K-fold Cross-Validation (CV) ensembles. The study suggests sele
Ensemble disagreement is a common method for estimating epistemic uncertainty in tasks like medical image segmentation. However, standard K-fold cross-validation (CV) ensembles, often termed 'deep ensembles' (DE), mix variability from different data subsets, which can distort uncertainty interpretation. This study compares a standard 5-fold CV ensemble against a 5-member DE (fixed training set, different seeds) across three multi-rater segmentation datasets and modalities.
Results show that DE ensembles improve calibration and failure detection, while CV ensembles correlate more strongly with inter-rater variability. The authors conclude that ensemble construction must align with the research question: DE is preferred for reliability-oriented tasks (like failure detection), and CV ensembles can serve as a proxy for ambiguity.
The work also introduces a lightweight modification to nnU-Net enabling DE training within the default pipeline.