ERStruct:从定序数据中推断人口结构的性别价值比率办法 (ERStruct: An Eigenvalue Ratio Approach to Inferring Population Structure from Sequencing Data)

Inference of population structure from genetic data plays an important role in population and medical genetics studies. The traditional EIGENSTRAT method has been widely used for computing and selecting top principal components that capture population structure information (Price et al., 2006). With the advancement and decreasing cost of sequencing technology, whole-genome sequencing data provide much richer information about the underlying population structures. However, the EIGENSTRAT method was originally developed for analyzing array-based genotype data and thus may not perform well on sequencing data for two reasons. First, the number of genetic variants $p$ is much larger than the sample size $n$ in sequencing data such that the sample-to-marker ratio $n/p$ is nearly zero, violating the assumption of the Tracy-Widom test used in the EIGENSTRAT method. Second, the EIGENSTRAT method might not be able to handle the linkage disequilibrium (LD) well in sequencing data. To resolve those two critical issues, we propose a new statistical method called ERStruct to estimate the number of sub-populations based on sequencing data. We propose to use the ratio of successive eigenvalues as a more robust testing statistic, and then we approximate the null distribution of our proposed test statistic using modern random matrix theory. Simulation studies found that our proposed ERStruct method has improved performance compared to the traditional Tracy-Widom test on sequencing data. We further illustrate our ERStruct method using the sequencing data set from the 1000 Genomes Project. We also implemented our ERStruct in a MATLAB toolbox which is now publicly available on github: https://github.com/bglvly/ERStruct.

翻译：遗传数据对人口结构的推断在人口和医学遗传学研究中起着重要作用。传统的EIGENSTRAT方法在计算和选择获取人口结构信息的最高主要组成部分时被广泛使用(Price等人,2006年)。随着测序技术的进步和成本的下降,全基因测序数据为基本人口结构提供了更丰富的信息。然而,EIGENSTRAT方法最初是为分析基于阵列的基因型数据而开发的,因此可能由于两个原因在测序数据方面效果不佳。第一,在测序数据时,传统的基因变种数量远远大于抽样规模($n美元),因此,样本对标价比率比率($/p$)的比率几乎为零,违反了EIGENSTRAAT方法中使用的TRA-Widom测试假设。EIGENSTRAAT方法可能无法处理基于阵列的不均匀数据(LD),因此,为了解决这两个关键问题,我们提议采用新的统计方法,称为ERST/ERSURtruct, 来估算基于测序数据的子序列的亚组数量。我们提议采用更稳性测试方法,我们采用最新的统计方法,我们现在的模型测试方法,我们还采用了的数值,我们采用新的数据序列分析方法,我们现在采用的基数基数。我们采用。我们采用的基数基号,我们使用的计算方法,我们采用新的数据,我们采用的基数号,我们使用的计算方法,我们使用的是采用新的数据,我们所测序算方法,我们使用的计算方法,我们采用较稳性测序数据,我们使用的计算方法,我们使用的计算方法,我们使用的方法是用新的数据,我们使用的计算方法,我们使用的试算。我们采用的基数基数基数。我们使用的计算方法,我们使用的计算方法,我们使用的计算方法,我们使用的计算方法,我们使用的计算方法,我们使用的计算方法,我们使用的计算方法,我们使用的基数基数基数基数基数基数基数基数基数基数基数基数基数基数基数。我们使用的计算。我们采用。我们采用。我们采用。我们采用。我们使用的计算方法,我们使用的计算方法,我们采用的基数。我们使用的方法是采用比较的基数组方法,我们采用的基数。我们采用的基数。我们测算。我们测序图,我们