This work reveals undiscovered challenges in the performance and generalizability of deep learning models. We (1) identify spurious shortcuts and evaluation issues that can inflate performance and (2) propose training and analysis methods to address them. We trained an AI model to classify cancer on a retrospective dataset of 120,112 US exams (3,467 cancers) acquired from 2008 to 2017 and 16,693 UK exams (5,655 cancers) acquired from 2011 to 2015. We evaluated on a screening mammography test set of 11,593 US exams (102 cancers; 7,594 women; age 57.1 \pm 11.0) and 1,880 UK exams (590 cancers; 1,745 women; age 63.3 \pm 7.2). A model trained on images of only view markers (no breast) achieved a 0.691 AUC. The original model trained on both datasets achieved a 0.945 AUC on the combined US+UK dataset but paradoxically only 0.838 and 0.892 on the US and UK datasets, respectively. Sampling cancers equally from both datasets during training mitigated this shortcut. A similar AUC paradox (0.903) occurred when evaluating diagnostic exams vs screening exams (0.862 vs 0.861, respectively). Removing diagnostic exams during training alleviated this bias. Finally, the model did not exhibit the AUC paradox over scanner models but still exhibited a bias toward Selenia Dimension (SD) over Hologic Selenia (HS) exams. Analysis showed that this AUC paradox occurred when a dataset attribute had values with a higher cancer prevalence (dataset bias) and the model consequently assigned a higher probability to these attribute values (model bias). Stratification and balancing cancer prevalence can mitigate shortcuts during evaluation. Dataset and model bias can introduce shortcuts and the AUC paradox, potentially pervasive issues within the healthcare AI space. Our methods can verify and mitigate shortcuts while providing a clear understanding of performance.
翻译:本文揭示了深度学习模型性能和泛化能力中未被发现的挑战。我们(1)识别了可以提高性能的虚假捷径和评估问题,并(2)提出了训练和分析方法来解决这些问题。我们在一个回顾性数据集上训练了一个人工智能模型,用于分类120,112个美国检查(3,467个癌症)从2008年到2017年和16,693个英国检查(5,655个癌症)从2011年到2015年。我们在11,593个美国检查(102个癌症; 7,594名女性; 年龄57.1±11.0)和1,880个英国检查(590个癌症; 1,745名女性; 年龄63.3±7.2)的筛查乳腺摄影测试集中进行了评估。一个仅针对视野标记(无乳腺)的图像进行训练的模型达到了0.691的AUC。在两个数据集上训练的原始模型在美国和英国数据集上分别只达到了0.838和0.892的AUC,但在合并的美国+英国数据集上达到了0.945的AUC,引起了矛盾。在训练期间等量地采样癌症可以缓解这个捷径。当评估诊断检查和筛查检查时,出现了类似的AUC矛盾(0.903),分别为0.862和0.861。在训练期间删除诊断检查可以缓解这种偏差。最后,模型在扫描模型上没有表现出AUC矛盾,但仍然对Selenia Dimension (SD)检查呈正偏移,对Hologic Selenia (HS)检查呈负偏移。分析表明,当数据集属性具有较高的癌症患病率(dataset bias)并且模型因此将更高的概率分配给这些属性值(model bias)时,会出现AUC矛盾。通过分层和平衡癌症患病率可以在评估过程中缓解捷径。数据集和模型偏差可能会引入捷径和AUC矛盾,这是医疗保健人工智能领域潜在的普遍问题。我们的方法可以验证和减轻沿用捷径时,还能提供清晰的性能理解。