The generalization performance of deep learning models for medical image analysis often decreases on images collected with different devices for data acquisition, device settings, or patient population. A better understanding of the generalization capacity on new images is crucial for clinicians' trustworthiness in deep learning. Although significant research efforts have been recently directed toward establishing generalization bounds and complexity measures, still, there is often a significant discrepancy between the predicted and actual generalization performance. As well, related large empirical studies have been primarily based on validation with general-purpose image datasets. This paper presents an empirical study that investigates the correlation between 25 complexity measures and the generalization abilities of supervised deep learning classifiers for breast ultrasound images. The results indicate that PAC-Bayes flatness-based and path norm-based measures produce the most consistent explanation for the combination of models and data. We also investigate the use of multi-task classification and segmentation approach for breast images, and report that such learning approach acts as an implicit regularizer and is conducive toward improved generalization.
翻译:医学图像分析的深层学习模型的普及性表现往往在收集数据、装置设置或病人人数的不同装置所收集的图像中减少。更好地了解新图像的普及能力对于临床医生深层学习的可信度至关重要。虽然最近进行了大量研究努力,以建立概括性界限和复杂度衡量标准,但预测性效果和实际一般化绩效之间往往存在重大差异。此外,相关的大型经验研究主要基于对通用图像数据集的验证。本文介绍了一项经验性研究,调查了25项复杂计量标准与受监督的乳腺癌超声波图像深度学习分类器的普及性能力之间的相互关系。研究结果表明,PAC-Bayes平板和路径规范性计量标准为模型和数据组合提供了最一致的解释。我们还调查了对乳房图像使用多任务分类和分解方法的情况,并报告说,这种学习方法是一种隐含的定序器,有利于改进一般化。