Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method (Patterson, Price, and Reich, 2006) originally developed for array-based genotype data for computing and selecting top principal components that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample-to-marker ratio n/p is nearly zero, violating the assumption of the Tracy-Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative principal components based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.
翻译:从基因数据中得出的人口结构推论在人口和医学遗传学研究中起着重要作用。随着测序技术的进步和成本的下降,越来越多的整个基因组测序数据提供了有关基本人口结构的更丰富的信息。传统的方法(Patterson、Price和Rich,2006年)最初是为基于阵列的基因型数据开发的,用于计算和选择反映人口结构的最主要组成部分,在测序数据上可能由于两个原因不能很好地发挥作用。首先,基因变异物数量大大大于在测序数据中的样本大小,因此样本-标记比率n/p接近于零,违反了其方法中使用的Tracy-Widom测试的假设。第二,它们的方法可能无法在测序数据时处理联系不均的问题。为了解决这两个实际问题,我们提出了一个名为ERStruct的新方法,以确定根据测序数据提供最高信息的主要组成部分的数目。更具体地说,我们提议使用连续的egen值比率作为更可靠的测试数据,然后用现代随机矩阵理论来估计其无效的分布。两个模拟研究和模型都展示了我们的GISMAR3号的模拟研究和应用。