Principal component analysis (PCA) is commonly used in genetics to infer and visualize population structure and admixture between populations. PCA is often interpreted in a way similar to inferred admixture proportions, where it is assumed that individuals belong to one of several possible populations or are admixed between these populations. We propose a new method to assess the statistical fit of PCA (interpreted as a model spanned by the top principal components) and to show that violations of the PCA assumptions affect the fit. Our method uses the chosen top principal components to predict the genotypes. By assessing the covariance (and the correlation) of the residuals (the differences between observed and predicted genotypes), we are able to detect violation of the model assumptions. Based on simulations and genome wide human data we show that our assessment of fit can be used to guide the interpretation of the data and to pinpoint individuals that are not well represented by the chosen principal components. Our method works equally on other similar models, such as the admixture model, where the mean of the data is represented by linear matrix decomposition.
翻译:主要成分分析(PCA)通常用于遗传学,以推断和直观地显示人口结构和人口之间的混合。五氯苯甲醚通常被以类似于推断的混合比例的方式解释,即假定个人属于几种可能的人口之一,或混杂于这些人口之中。我们提出一种新的方法来评估五氯苯甲醚的统计适合性(被最高主要组成部分解释为一个模型),并表明违反五氯苯甲醚的假设会影响适应性。我们的方法使用选定的顶级主要组成部分来预测基因型。通过评估残留物(观察到的和预测的基因型的差别)的共变(和相关性),我们能够发现违反模型假设的情况。根据模拟和基因组广泛的人类数据,我们表明,对是否适合性的评估可以用来指导对数据的解释,并查明被选定的主要组成部分不能很好代表的个人。我们的方法对其他类似模型,如粘合模型,例如数据平均值以线性矩阵解剖为代表的粘合模型同样起作用。