While it is well known that population differences from genetics, sex, race, and environmental factors contribute to disease, AI studies in medicine have largely focused on locoregional patient cohorts with less diverse data sources. Such limitation stems from barriers to large-scale data share and ethical concerns over data privacy. Federated learning (FL) is one potential pathway for AI development that enables learning across hospitals without data share. In this study, we show the results of various FL strategies on one of the largest and most diverse COVID-19 chest CT datasets: 21 participating hospitals across five continents that comprise >10,000 patients with >1 million images. We also propose an FL strategy that leverages synthetically generated data to overcome class and size imbalances. We also describe the sources of data heterogeneity in the context of FL, and show how even among the correctly labeled populations, disparities can arise due to these biases.
翻译:尽管众所周知,遗传学、性别、种族和环境因素对疾病有所贡献,但医学人工智能研究主要集中在本地患者队列,数据来源较少。这种限制源于大规模数据共享的障碍和对数据隐私的伦理关切。联邦学习(FL)是一种潜在的人工智能发展途径,它使得在不共享数据的情况下在医院之间进行学习成为可能。在本研究中,我们展示了各种FL策略在其中一种最大、最多样化COVID-19胸部CT数据集上的结果:21家参与者医院、涵盖5个大洲、共计>1万名患者,超过100万张图像。我们还提出了一种FL策略,利用人工生成的数据克服类别和大小不平衡问题。我们还描述了FL背景下的数据异质性来源,并展示了即使在正确标记的族群中,这些偏差也可能导致差异。