One challenge in exploratory association studies using observational data is that the signals are potentially weak and the features have complex correlation structures. False discovery rate (FDR) controlling procedures can provide important statistical guarantees for replicability in risk factor identification in exploratory research. In the recently established National COVID Collaborative Cohort (N3C), electronic health record (EHR) data on the same set of candidate features are independently collected in multiple different sites, offering opportunities to identify signals by combining information from different sources. This paper presents a general knockoff-based variable selection algorithm to identify mutual signals from unions of group-level conditional independence tests with exact FDR control guarantees under finite sample settings. This algorithm can work with general regression settings, allowing heterogeneity of both the predictors and the outcomes across multiple data sources. We demonstrate the performance of this method with extensive numerical studies and an application to the N3C data.
翻译:利用观测数据的探索性协会研究中的一项挑战是,信号可能很弱,而且其特征具有复杂的关联结构。假发现率(FDR)控制程序可以为探索性研究中危险因素识别的可复制性提供重要的统计保障。在最近建立的国家COVID合作 Cohort(N3C)中,关于同一组候选特征的电子健康记录(EHR)数据在多个不同地点独立收集,通过综合来自不同来源的信息,为识别信号提供了机会。本文介绍了一种基于登门的可变变量选择算法,以在有限的抽样环境下,通过精确的FDR控制保证,确定群体一级有条件独立测试联盟的相互信号。这种算法可以与一般回归环境合作,允许预测器的异质性以及多个数据来源的结果。我们用大量的数字研究和对N3C数据的应用来证明这一方法的绩效。</s>