Genomic data arising from a genome-wide association study (GWAS) are often not only of large-scale, but also incomplete. A specific form of their incompleteness is missing values with non-ignorable missingness mechanism. The intrinsic complications of genomic data present significant challenges in developing an unbiased and informative procedure of phenotype-genotype association analysis by a statistical variable selection approach. In this paper we develop a coherent procedure of categorical phenotype-genotype association analysis, in the presence of missing values with non-ignorable missingness mechanism in GWAS data, by integrating the state-of-the-art methods of random forest for variable selection, weighted ridge regression with EM algorithm for missing data imputation, and linear statistical hypothesis testing for determining the missingness mechanism. Two simulated GWAS are used to validate the performance of the proposed procedure. The procedure is then applied to analyze a real data set from breast cancer GWAS.
翻译:基因组学研究(GWAS)产生的基因组学数据往往不仅大范围,而且不完整,其不完备的具体形式是缺少与不可忽略的缺失机制有关的数值;基因组学数据固有的复杂因素对通过统计变量选择方法制定无偏见和知情的苯型基因类协会分析程序提出了重大挑战;在本文件中,我们开发了一个一致的绝对苯型基因类协会分析程序,在GWAS数据中缺少的数值与不可忽略的缺失机制存在缺失的情况下,通过将随机森林的最新方法纳入变量选择、加权脊柱回归和缺失数据估算方法的EM算法,以及确定缺失机制的线性统计假设测试。两个模拟的GWAS用于验证拟议程序的绩效,然后用于分析乳腺癌GWAS的一套真实数据。