USE-Evaluuator: 具有不确定、小型或空参考说明的医学图像分割模型性能计量器 (USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations)

Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets that include low signal pathologies, a difficult segmentation task, and uncertain, small, or empty reference annotations. This limitation may result in ineffective research of machine learning practitioners in designing and optimizing models. Dimensions of evaluating clinical value include consideration of the uncertainty of reference annotations, independence from reference annotation volume size, and evaluation of classification of empty reference annotations. We study how uncertain, small, and empty reference annotations influence the value of metrics for medical image segmentation on an in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify metrics with clinical value. We compare to a public benchmark data set (BraTS 2019) with a high-signal pathology and certain, larger, and no empty reference annotations. We may show machine learning practitioners, how uncertain, small, or empty reference annotations require a rethinking of the evaluation and optimizing procedures. The evaluation code was released to encourage further analysis of this topic. https://github.com/SophieOstmeier/UncertainSmallEmpty.git

翻译：医疗图像分解模型的性能衡量标准用于衡量参考说明和预测分解模型之间的协议。通常,重复指标,例如Dice,被用作评估这些模型绩效的一种衡量标准,以便进行比较;然而,案件分布与公共数据集中分解任务难度水平不匹配,与临床实践相比,公共数据集中案件分布与分解任务难度水平不匹配。通用指标无法衡量这种不匹配的影响,特别是临床数据集,包括信号病理低、难分解任务以及不确定、小型或空的参考说明。这种限制可能导致机器学习实践者在设计和优化模型方面进行无效的研究。临床价值评估的范围包括考虑参考说明的不确定性,独立于参考说明数量大小,以及评价空参考说明的分类。我们研究的是,无论模型如何,都对内部数据集中医疗图像分解的度值影响多少,我们检查了标准深度学习框架的测量方法,以便识别具有临床价值的标准深度、小型或空值。我们鼓励对参考说明的不确定性说明作比较,我们将O型说明与高清晰度路标/图表进行更深入的分析。我们没有向公共基准路标/图表进行这样的研究。我们可以向某些路路路路路路路路的比较。我们学习了。