An important issue in medical image processing is to be able to estimate not only the performances of algorithms but also the precision of the estimation of these performances. Reporting precision typically amounts to reporting standard-error of the mean (SEM) or equivalently confidence intervals. However, this is rarely done in medical image segmentation studies. In this paper, we aim to estimate what is the typical confidence that can be expected in such studies. To that end, we first perform experiments for Dice metric estimation using a standard deep learning model (U-net) and a classical task from the Medical Segmentation Decathlon. We extensively study precision estimation using both Gaussian assumption and bootstrapping (which does not require any assumption on the distribution). We then perform simulations for other test set sizes and performance spreads. Overall, our work shows that small test sets lead to wide confidence intervals (e.g. $\sim$8 points of Dice for 20 samples with $\sigma \simeq 10$).
翻译:医学图像处理中的一个重要问题是,不仅能够估计算法的性能,而且能够估计这些性能的精确度。报告精确度通常相当于报告平均值(SEM)的标准误差或相当的置信度间隔。然而,在医学图像分割研究中却很少这样做。在本文中,我们的目标是估计这类研究中可以预期的典型信任度。为此,我们首先使用标准的深层次学习模型(U-net)和医学剖析 Decathlon 的经典任务来进行Dice 量估测实验。我们利用高斯假设和靴式(不需要在分布上作任何假设)进行广泛研究精确度估测。我们然后进行其他测试集尺寸和性能分布的模拟。总体来说,我们的工作表明,小型测试组可以导致宽广的置信度间隔(例如:$\sigma =simeq 10美元,20个样本的Dice值为$\sim8 point)。