Performance metrics for medical image segmentation models are used to measure agreement between the reference annotation and the prediction. A common set of metrics is used in the development of such models to make results more comparable. However, there is a mismatch between the distributions in public data sets and cases encountered in clinical practice. Many common metrics fail to measure the impact of this mismatch, especially for clinical data sets containing uncertain, small or empty reference annotation. Thus, models may not be validated for clinically meaningful agreement by such metrics. Dimensions of evaluating clinical value include independence from reference annotation volume size, consideration of uncertainty of reference annotations, reward of volumetric and/or location agreement and reward of correct classification of empty reference annotations. Unlike common public data sets, our in-house data set is more representative. It contains uncertain, small or empty reference annotations. We examine publicly available metrics on the predictions of a deep learning framework in order to identify for which settings common metrics provide clinical meaningful results. We compare to a public benchmark data set without uncertain, small or empty reference annotations. https://github.com/SophieOstmeier/UncertainSmallEmpty
翻译:医疗图象分解模型的性能衡量标准用于衡量参考说明和预测之间的协议。在开发这类模型时,使用一套通用的衡量标准,使结果更具可比性。然而,公共数据集的分布与临床实践中遇到的病例不匹配。许多通用指标未能测量这种不匹配的影响,特别是包含不确定、小或空参考注释的临床数据集。因此,可能无法验证模型,以便通过此类指标在临床上达成有意义的协议。评估临床价值的方方面面包括:独立于参考说明数量大小,考虑参考说明的不确定性,体积和(或)地点协议的奖励,以及正确分类空参考说明的奖励。与普通公共数据集不同,我们内部数据集更具代表性,它含有不确定、小或空的参考说明。我们研究了关于深层次学习框架预测的公开指标,以确定哪些环境提供了具有临床意义的结果。我们比较了一套没有不确定性、小或空参考说明的公共基准数据集。https://github.com/ShophieOstmeier/UnscerStainSainall。