综合多组综合分析中缺少的数据内插,有不同的共变信息 (Missing data interpolation in integrative multi-cohort analysis with disparate covariate information)

Ekaterina Smirnova,Yongqi Zhong,Rasha Alsaadawi,Xu Ning,Amii Kress,Jordan Kuiper,Mingyu Zhang,Kristen Lyall,Sheenas Martenies,Akram Alshawabkeh,Catherine Bulka,Carlos Camargo,Jaeun Choi,Elena Colicino,Anne Dunlop,Michael Elliott,Assiamira Ferrara,Tebeb Gebrestadik,Jiang Gui,Kylie Harrall,Tina Hartert,Barry Lester,Andrew Manigault,Justin Manjourides,Yu Ni,Rosalind Wright,Robert Wright,Katherine Ziegler,Bryan Lau

Integrative analysis of datasets generated by multiple cohorts is a widely-used approach for increasing sample size, precision of population estimators, and generalizability of analysis results in epidemiological studies. However, often each individual cohort dataset does not have all variables of interest for an integrative analysis collected as a part of an original study. Such cohort-level missingness poses methodological challenges to the integrative analysis since missing variables have traditionally: (1) been removed from the data for complete case analysis; or (2) been completed by missing data interpolation techniques using data with the same covariate distribution from other studies. In most integrative-analysis studies, neither approach is optimal as it leads to either loosing the majority of study covariates or challenges in specifying the cohorts following the same distributions. We propose a novel approach to identify the studies with same distributions that could be used for completing the cohort-level missing information. Our methodology relies on (1) identifying sub-groups of cohorts with similar covariate distributions using cohort identity random forest prediction models followed by clustering; and then (2) applying a recursive pairwise distribution test for high dimensional data to these sub-groups. Extensive simulation studies show that cohorts with the same distribution are correctly grouped together in almost all simulation settings. Our methods' application to two ECHO-wide Cohort Studies reveals that the cohorts grouped together reflect the similarities in study design. The methods are implemented in R software package relate.

翻译：对多个组群产生的数据集进行综合分析是一种广泛使用的方法,用于增加抽样规模、人口估计精确度和流行病学研究分析结果的通用性,但每个组群数据集往往没有所有感兴趣的变量,用于作为原始研究的一部分收集的综合分析。这种群群级缺失对综合分析构成方法上的挑战,因为缺失变量历来是:(1) 从数据中删除,用于完整的案例分析;或者(2) 使用与其他研究相同的共变式分布数据来完成缺失的数据内插技术。在大多数综合分析研究中,两种方法都不最理想,因为它导致在按照相同分布来指定组群时,出现大多数研究共变式或挑战。我们提出了一种新颖的方法,用以确定可用于完成组群级缺失信息的相同分布的研究。我们的方法依赖于:(1) 利用群群身份随机森林预测模型,确定具有类似共变式分布的组群组分组;以及(2) 在大多数综合分析研究中,对高维值数据的共变式分布测试或确定组群组群时遇到的难题。我们提出的新办法是,在类组群组别中,我们采用的模拟研究组群群类组群中采用两种模拟方法。