The generalization error of deep learning models for medical image analysis often decreases on images collected with different devices for data acquisition, device settings, or patient population. A better understanding of the generalization capacity on new images is crucial for clinicians' trustworthiness in deep learning. Although significant research efforts have been recently directed toward establishing generalization bounds and complexity measures, still, there is often a significant discrepancy between the predicted and actual generalization performance. As well, related large empirical studies have been primarily based on validation with general-purpose image datasets. This paper presents an empirical study that investigates the correlation between 25 complexity measures and the generalization abilities of supervised deep learning classifiers for breast ultrasound images. The results indicate that PAC-Bayes flatness-based and path norm-based measures produce the most consistent explanation for the combination of models and data. We also investigate the use of multi-task classification and segmentation approach for breast images, and report that such learning approach acts as an implicit regularizer and is conducive toward improved generalization.
翻译:医学图像分析的深层学习模型的概括错误往往在收集数据、装置设置或病人人数的不同装置所收集的图像中减少。更好地了解新图像的概括能力对于临床医生深层学习的可信度至关重要。虽然最近进行了大量研究努力,以建立一般化界限和复杂度衡量标准,但预测的和实际的概括性表现之间往往存在重大差异。此外,相关的大型经验研究主要基于对通用图像数据集的验证。本文介绍了一项经验性研究,调查了25项复杂措施与受监督的乳腺癌超声波图像深层学习分类器的一般化能力之间的相互关系。研究结果表明,PAC-Bayes平板和路径规范性衡量措施为模型和数据的组合提供了最一致的解释。我们还调查了对乳房图像使用多任务分类和分解方法的情况,并报告说,这种学习方法是一种隐含的定序器,有利于改进一般化。