Ensemble Methods for Uncertainty: Why Deep Ensembles Outperform Cross-Validation in Segmentation

Ensemble disagreement is a common method for estimating epistemic uncertainty in tasks like medical image segmentation. However, standard K-fold cross-validation (CV) ensembles, often termed 'deep ensembles' (DE), mix variability from different data subsets, which can distort uncertainty interpretation. This study compares a standard 5-fold CV ensemble against a 5-member DE (fixed training set, different seeds) across three multi-rater segmentation datasets and modalities.

Results show that DE ensembles improve calibration and failure detection, while CV ensembles correlate more strongly with inter-rater variability. The authors conclude that ensemble construction must align with the research question: DE is preferred for reliability-oriented tasks (like failure detection), and CV ensembles can serve as a proxy for ambiguity.

The work also introduces a lightweight modification to nnU-Net enabling DE training within the default pipeline.

Ensemble Methods for Uncertainty: Why Deep Ensembles Outperform Cross-Validation in Segmentation

More from this section

Ensemble Methods for Uncertainty: Why Deep Ensembles Outperform Cross-Validation in Segmentation

More from this section