Validation metrics are a key prerequisite for the reliable tracking of scientific progress and for deciding on the potential clinical translation of methods. While recent initiatives aim to develop comprehensive theoretical frameworks for understanding metric-related pitfalls in image analysis problems, there is a lack of experimental evidence on the concrete effects of common and rare pitfalls on specific applications. We address this gap in the literature in the context of colon cancer screening. Our contribution is twofold. Firstly, we present the winning solution of the Endoscopy computer vision challenge (EndoCV) on colon cancer detection, conducted in conjunction with the IEEE International Symposium on Biomedical Imaging (ISBI) 2022. Secondly, we demonstrate the sensitivity of commonly used metrics to a range of hyperparameters as well as the consequences of poor metric choices. Based on comprehensive validation studies performed with patient data from six clinical centers, we found all commonly applied object detection metrics to be subject to high inter-center variability. Furthermore, our results clearly demonstrate that the adaptation of standard hyperparameters used in the computer vision community does not generally lead to the clinically most plausible results. Finally, we present localization criteria that correspond well to clinical relevance. Our work could be a first step towards reconsidering common validation strategies in automatic colon cancer screening applications.
翻译:验证指标是可靠跟踪科学进步和决定方法可能的临床翻译的关键先决条件。最近的一些倡议旨在建立全面理论框架,以了解图像分析问题中与指标相关的缺陷,但缺乏关于具体应用中常见和罕见缺陷的具体影响的实验性证据。我们在结肠癌筛查方面处理文献中的这一差距。我们的贡献是双重的。首先,我们介绍了与IEEEE 生物医学成像问题国际研讨会(IMSBI 2022)一起进行的内镜计算机诊断结肠癌标准透视挑战(EndoCV)的胜利解决方案。第二,我们展示了常用计量对一系列超参数的敏感性以及不良计量选择的后果。根据从6个临床中心对病人数据进行的全面验证研究,我们发现所有常用的物体检测指标都具有很高的中间变异性。此外,我们的结果清楚地表明,计算机视觉社区所用标准超参数的调整一般不会导致临床最可信的结果。最后,我们提出的本地化标准可以符合共同的临床检验标准。