We investigate saddlepoint approximations applied to the score test statistic in genome-wide association studies with binary phenotypes. The inaccuracy in the normal approximation of the score test statistic increases with increasing sample imbalance and with decreasing minor allele count. Applying saddlepoint approximations to the score test statistic distribution greatly improve the accuracy, even far out in the tail of the distribution. By using exact results for an intercept model and binary covariate model, as well as simulations for models with nuisance parameters, we emphasize the need for continuity corrections in order to achieve valid $p$-values. The performance of the saddlepoint approximations is evaluated by overall and conditional type I error rate on simulated data. We investigate the methods further by using data from UK Biobank with skin and soft tissue infections as phenotype, using both common and rare variants. The analysis confirms that continuity correction is important particularly for rare variants, and that the normal approximation gives a highly inflated type I error rate for case imbalance.
翻译:我们调查了用于全基因组协会研究中与二元苯菌型有关的评分测试统计的马鞍点近似值。 得分测试正常近似值的不准确性随着抽样失衡的增加和微小异差数的减少而增加。 对得分测试应用马鞍点近近似值, 统计分布会大大提高准确性, 甚至远在分布的尾端。 通过使用截取模型和二元共变模型的精确结果, 以及模拟具有扰动参数的模型, 我们强调必须对连续性进行校正, 以实现有效的美元价值。 使用模拟数据的总体和有条件的I型误差率来评估马鞍点近近似值的性能。 我们进一步调查方法, 使用常见和罕见的变异物, 使用英国生物库中皮肤和软组织感染的苯型数据进行进一步调查。 分析证实, 对稀有变体来说, 持续性修正尤其重要, 而正常的近似值则给出了高度膨胀的I型误率, 以弥补案件不平衡。