Synthesizing information from multiple data sources is critical to ensure knowledge generalizability. Integrative analysis of multi-source data is challenging due to the heterogeneity across sources and data-sharing constraints due to privacy concerns. In this paper, we consider a general robust inference framework for federated meta-learning of data from multiple sites, enabling statistical inference for the prevailing model, defined as the one matching the majority of the sites. Statistical inference for the prevailing model is challenging since it requires a data-adaptive mechanism to select eligible sites and subsequently account for the selection uncertainty. We propose a novel sampling method to address the additional variation arising from the selection. Our devised CI construction does not require sites to share individual-level data and is shown to be valid without requiring the selection of eligible sites to be error-free. The proposed robust inference for federated meta-learning (RIFL) methodology is broadly applicable and illustrated with three inference problems: aggregation of parametric models, high-dimensional prediction models, and inference for average treatment effects. We use RIFL to perform federated learning of mortality risk for patients hospitalized with COVID-19 using real-world EHR data from 16 healthcare centers representing 275 hospitals across four countries.
翻译:综合来自多个数据来源的信息对于确保知识的普遍性至关重要。多来源数据的综合分析具有挑战性,因为不同来源之间的差异以及由于隐私问题造成的数据共享制约,因此多来源数据的综合分析具有挑战性。在本文件中,我们认为,对于从多个站点对数据进行联合元学习,一个普遍模式(定义为与大多数站点相匹配的模型)的统计推理是十分重要的。当前模式的统计推论具有挑战性,因为它需要一个数据适应机制来选择合格地点,随后还要说明选择的不确定性。我们建议采用新的抽样方法来解决选择中产生的额外差异。我们设计的CI公司建设并不要求共享单个站点的数据,而是在不要求选择合格站点为无误的情况下,证明是有效的。拟议的对当前模式的统计推论是广泛适用的,并用三个推论问题来说明:模拟模型的汇总、高维度预测模型,以及平均治疗效果的推论。我们使用RIFL公司对四家医院的死亡率进行16-19级健康风险的联邦化学习。我们使用REFL公司对四家住院病人进行ECOVI的住院医院进行16个实际健康风险研究。