The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research has become a barrier to translating precision medicine research into practice. Due to heterogeneity across populations, risk prediction models are often found to be underperformed in these underrepresented populations, and therefore may further exacerbate known health disparities. In this paper, we propose a two-way data integration strategy that integrates heterogeneous data from diverse populations and from multiple healthcare institutions via a federated transfer learning approach. The proposed method can handle the challenging setting where sample sizes from different populations are highly unbalanced. With only a small number of communications across participating sites, the proposed method can achieve performance comparable to the pooled analysis where individual-level data are directly pooled together. We show that the proposed method improves the estimation and prediction accuracy in underrepresented populations, and reduces the gap of model performance across populations. Our theoretical analysis reveals how estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We demonstrate the feasibility and validity of our methods through numerical experiments and a real application to a multi-center study, in which we construct polygenic risk prediction models for Type II diabetes in AA population.
翻译:在大规模临床和基因组研究中,少数群体和处境不利人口的代表性有限,已成为将精密医学研究转化为实践的障碍。由于各人口之间存在差异,风险预测模型在这些代表性不足的人口中往往表现不佳,因此可能进一步加剧已知的健康差异。在本文件中,我们建议采用双向数据整合战略,通过联合转移学习方法,将不同人口和多个保健机构的不同数据整合在一起。拟议方法可以处理不同人口抽样规模高度不平衡的富有挑战性的环境。由于参与地点之间只有少量通信,拟议方法可以达到与集合分析相似的性能,即个人数据直接集中在一起。我们表明,拟议方法提高了代表性不足的人口的估计和预测准确性,缩小了人口模型绩效的差距。我们的理论分析表明,对准确性的估计如何受到通信预算、隐私限制和不同人口差异性的影响。我们通过数字实验和对多点研究的实际应用,展示了我们方法的可行性和有效性,我们在该研究中为AA型二型糖尿病构建了多指标风险预测模型。