ERStruct:从定序数据中推断人口结构的性别价值比率办法 (ERStruct: An Eigenvalue Ratio Approach to Inferring Population Structure from Sequencing Data)

Inference of population structure from genetic data plays an important role in population and medical genetics studies. The traditional EIGENSTRAT method has been widely used for computing and selecting top principal components that capture population structure information (Price et al., 2006). With the advancement and decreasing cost of sequencing technology, whole-genome sequencing data provide much richer information about the underlying population structures. However, the EIGENSTRAT method was originally developed for analyzing array-based genotype data and thus may not perform well on sequencing data for two reasons. First, the number of genetic variants $p$ is much larger than the sample size $n$ in sequencing data such that the sample-to-marker ratio $n/p$ is nearly zero, violating the assumption of the Tracy-Widom test used in the EIGENSTRAT method. Second, the EIGENSTRAT method might not be able to handle the linkage disequilibrium (LD) well in sequencing data. To resolve those two critical issues, we propose a new statistical method called ERStruct to estimate the number of latent sub-populations based on sequencing data. We propose to use the ratio of successive eigenvalues as a more robust testing statistic, and then we approximate the null distribution of our proposed test statistic using modern random matrix theory. Simulation studies found that our proposed ERStruct method has outperformed the traditional Tracy-Widom test on sequencing data. We further use two public data sets from the HapMap 3 and the 1000 Genomes Projects to demonstrate the performance of our ERStruct method. We also implement our ERStruct in a MATLAB toolbox which is now publicly available on GitHub through https://github.com/bglvly/ERStruct.

翻译：遗传数据对人口结构的推断在人口和医学遗传学研究中起着重要作用。传统的EIGENSTRAT方法在计算和选择获取人口结构信息的最高主要组成部分时被广泛使用(Price等人,2006年)。随着测序技术的进步和成本的下降,全基因测序数据为基本人口结构提供了更丰富的信息。然而,EIGENSTRAT方法最初是为分析基于阵列的基因型数据而开发的,因此可能由于两个原因在测序数据方面效果不佳。首先,基因变异体的数量大大高于在测序数据时的样本规模($美元),因此,样本对标记值比率比率($/p$)比率几乎为零,这违反了EIGENSTRAAT方法中使用的TRAT测试假设。为了解决这两个关键问题,我们提出了一个新的统计方法,称为ERSATtrut, 以根据测序数据测序的精确值估算潜藏子组数。我们还提议,在测序模型中,我们用一个更稳性的数据分析工具,我们现在用Sqoursal Testal exal exal exal exal exal dal 。我们发现,我们现在的测试工具在测序数据中,我们用一个更稳性数据流数据流数据流的测试工具,我们发现了一个测试工具,我们现在用一个更精确的计算。