The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based GxE tests, to permit GxE VC tests for biobank-scale data. SEAGLE employs modern matrix computations to achieve the same "exact" results as the original GxE VC tests without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of $10^5$, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate SEAGLE's performance through extensive simulations. We illustrate its utility by conducting genome-wide gene-based GxE analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.
翻译:生物库数据的爆炸为基因-环境(GxE)对复杂疾病进行互动研究提供了即时机会,因为样本规模大,遗传和非遗传信息收集量丰富。然而,极高的样本规模也给GxE评估带来了新的计算挑战,特别是用于基于定点的GxE差异部分(VC)测试,这是广泛使用的一种战略,用以提升GxE总体信号,并评价具有生物意义单位(例如基因)多种变体的GxE联合效应。在这项工作中,我们侧重于连续的特性,并展示SEAGLE, 大规模基于定点的GxE测试的可缩放的Exact AlGorithm, 以便允许GxE测试生物库数据。SEGLE使用现代矩阵计算方法,以取得与原GxEVC测试相同的“精确”结果,而不增加假设或依赖近似值。 SEAGLE可以很容易地适应标准笔电脑的样本大小,可以执行,并且不需要大规模进行基于SEAGEE的物理数据模拟,我们通过SEAGGBBBBBBBA展示了整个数据库的物理数据。