Modeling and drawing inference on the joint associations between single nucleotide polymorphisms and a disease has sparked interest in genome-wide associations studies. In the motivating Boston Lung Cancer Survival Cohort (BLCSC) data, the presence of a large number of single nucleotide polymorphisms of interest, though smaller than the sample size, challenges inference on their joint associations with the disease outcome. In similar settings, we find that neither the de-biased lasso approach (van de Geer et al. 2014), which assumes sparsity on the inverse information matrix, nor the standard maximum likelihood method can yield confidence intervals with satisfactory coverage probabilities for generalized linear models. Under this "large $n$, diverging $p$" scenario, we propose an alternative de-biased lasso approach by directly inverting the Hessian matrix without imposing the matrix sparsity assumption, which further reduces bias compared to the original de-biased lasso and ensures valid confidence intervals with nominal coverage probabilities. We establish the asymptotic distributions of any linear combinations of the parameter estimates, which lays the theoretical ground for drawing inference. Simulations show that the proposed refined de-biased estimating method performs well in removing bias and yields honest confidence interval coverage. We use the proposed method to analyze the aforementioned BLCSC data, a large scale hospital-based epidemiology cohort study, that investigates the joint effects of genetic variants on lung cancer risks.
翻译:模拟和推论单一核糖酸多元形态和一种疾病之间的联合关联,引起了对全基因组协会研究的兴趣。在《波士顿肺癌生存联盟》的激励性数据中,大量感兴趣的单核酸多元形态存在,尽管比样本规模小,但对其与疾病结果的联合关联构成的挑战性推论。在类似环境下,我们发现,无论是非偏向性拉索方法(van de Geer等人,2014年),即假定反向信息矩阵的宽度,还是标准最大可能性方法,都无法产生信任间隔,而通用线性模型的覆盖概率是令人满意的。在这种“大美元,差异美元”假设下,我们建议采用另一种降低偏差的拉索方法,即直接对赫森基矩阵与疾病结果的联合关联性假设(van de Geer et al.,2014年),即假设反向原偏向偏向偏移的拉索(van de Geer等人,2014年)方法,以及标准最大可能性方法可以产生令人满意的信任间隔。我们为普通线性模型的覆盖度设定了信心间隔期。我们建立了一种无偏向性范围的研究,在估算任何精度的实验室模型中,对正值的模型的比值的比值的模型的比值分析方法进行了评估。