It has been rightfully emphasized that the use of AI for clinical decision making could amplify health disparities. An algorithm may encode protected characteristics, and then use this information for making predictions due to undesirable correlations in the (historical) training data. It remains unclear how we can establish whether such information is actually used. Besides the scarcity of data from underserved populations, very little is known about how dataset biases manifest in predictive models and how this may result in disparate performance. This article aims to shed some light on these issues by exploring new methodology for subgroup analysis in image-based disease detection models. We utilize two publicly available chest X-ray datasets, CheXpert and MIMIC-CXR, to study performance disparities across race and biological sex in deep learning models. We explore test set resampling, transfer learning, multitask learning, and model inspection to assess the relationship between the encoding of protected characteristics and disease detection performance across subgroups. We confirm subgroup disparities in terms of shifted true and false positive rates which are partially removed after correcting for population and prevalence shifts in the test sets. We further find a previously used transfer learning method to be insufficient for establishing whether specific patient information is used for making predictions. The proposed combination of test-set resampling, multitask learning, and model inspection reveals valuable new insights about the way protected characteristics are encoded in the feature representations of deep neural networks.
翻译:人们正确地强调,将AI用于临床决策可以扩大健康差异。算法可以将受保护的特征编码,然后利用这一信息进行预测,因为(历史)培训数据中存在不可取的关联性。我们仍然不清楚我们如何能够确定这种信息是否得到实际使用。除了服务不足的人口缺乏数据之外,对于数据集偏差如何在预测模型中表现出来,以及这如何导致不同的性能,人们很少知道这一点。本篇文章的目的是通过探索在图像型疾病检测模型中进行分组分析的新方法,对这些问题进行某种了解。我们利用两种公开的胸前X射线数据集,即CheXpert和MIMIMIC-CXR,来研究深层学习模型中种族和生物性别之间的性表现差异。我们探索测试成套的测试、转移学习、多任务学习以及模型检查,以评估保护特性和疾病检测在各子群之间作用之间的关系。我们确认在改变真实和假正率方面存在着一些分级差异,这些差异在对基于图像的人群和流行变化进行校正后被部分删除。我们进一步发现,一种以前使用的胸X-射线数据集和MIMI-C-CXRRRRR的模型的模型分析方法对于确定具体病人的特性的使用不够充分。