In multicenter biomedical research, integrating data from multiple decentralized sites provides more robust and generalizable findings due to its larger sample size and the ability to account for the between-site heterogeneity. However, sharing individual-level data across sites is often difficult due to patient privacy concerns and regulatory restrictions. To overcome this challenge, many distributed algorithms, that fit a global model by only communicating aggregated information across sites, have been proposed. A major challenge in applying existing distributed algorithms to real-world data is that their validity often relies on the assumption that data across sites are independently and identically distributed, which is frequently violated in practice. In biomedical applications, data distributions across clinical sites can be heterogeneous. Additionally, the set of covariates available at each site may vary due to different data collection protocols. We propose a distributed inference framework for data integration in the presence of both distribution heterogeneity and data structural heterogeneity. By modeling heterogeneous and structurally missing data using density-tilted generalized method of moments, we developed a general aggregated data-based distributed algorithm that is communication-efficient and heterogeneity-aware. We establish the asymptotic properties of our estimator and demonstrate the validity of our method via simulation studies.
翻译:在多中心生物医学研究中,整合来自多个分散站点的数据因其更大的样本量以及能够考虑站点间异质性的能力,可提供更稳健且更具普适性的研究结果。然而,由于患者隐私顾虑和监管限制,跨站点共享个体层面的数据通常十分困难。为应对这一挑战,已有许多分布式算法被提出,这些算法仅通过跨站点通信聚合信息来拟合全局模型。将现有分布式算法应用于现实世界数据的一个主要挑战在于,其有效性通常依赖于各站点数据独立同分布的假设,而这一假设在实践中常被违背。在生物医学应用中,不同临床站点的数据分布可能是异质的。此外,由于数据收集协议不同,各站点可用的协变量集合也可能存在差异。我们提出了一种分布式推断框架,用于在存在分布异质性和数据结构异质性的情况下进行数据整合。通过使用密度倾斜广义矩估计方法对异构且结构缺失的数据进行建模,我们开发了一种通用的、基于聚合数据的分布式算法,该算法具有通信高效性并能感知数据异质性。我们建立了估计量的渐近性质,并通过模拟研究验证了该方法的有效性。