In many scenarios such as genome-wide association studies where dependences between variables commonly exist, it is often of interest to infer the interaction effects in the model. However, testing pairwise interactions among millions of variables in complex and high-dimensional data suffers from low statistical power and huge computational cost. To address these challenges, we propose a two-stage testing procedure with false discovery rate (FDR) control, which is known as a less conservative multiple-testing correction. Theoretically, the difficulty in the FDR control dues to the data dependence among test statistics in two stages, and the fact that the number of hypothesis tests conducted in the second stage depends on the screening result in the first stage. By using the Cram\'er type moderate deviation technique, we show that our procedure controls FDR at the desired level asymptotically in the generalized linear model (GLM), where the model is allowed to be misspecified. In addition, the asymptotic power of the FDR control procedure is rigorously established. We demonstrate via comprehensive simulation studies that our two-stage procedure is computationally more efficient than the classical BH procedure, with a comparable or improved statistical power. Finally, we apply the proposed method to a bladder cancer data from dbGaP where the scientific goal is to identify genetic susceptibility loci for bladder cancer.
翻译:在许多情景中,如基因组联系研究中,变量之间通常存在依赖性,因此往往有兴趣推断模型中的相互作用效应;然而,在复杂和高维数据中,数以百万计变量之间的对称互动,具有低统计功率和巨大的计算成本。为了应对这些挑战,我们提议采用假发现率(FDR)控制,称为保守程度较低的多测试校正的两阶段测试程序。理论上,FDR控制的困难在于测试统计数据在两个阶段之间依赖数据,而第二阶段进行的假设测试数量取决于第一阶段的筛选结果。我们利用Cram\'er型中度偏差技术,表明我们的程序在理想水平上控制FDR,在通用线性模型(GLM)中,允许错误地描述该模型。此外,FDR控制程序的微弱功能得到了严格确立。我们通过综合模拟研究证明,我们的两阶段程序在计算上比古典BHci程序效率更高,从可比较或改进的Syriversal数据应用了我们提议的从可比较或更精确的Syal ASyal ASyal数据,最后将我们提出的两阶段程序用于从可比较或更精确的BIRC。