Risk prediction models using genetic data have seen increasing traction in genomics. However, most of the polygenic risk models were developed using data from participants with similar (mostly European) ancestry. This can lead to biases in the risk predictors resulting in poor generalization when applied to minority populations and admixed individuals such as African Americans. To address this bias, largely due to the prediction models being confounded by the underlying population structure, we propose a novel deep-learning framework that leverages data from diverse population and disentangles ancestry from the phenotype-relevant information in its representation. The ancestry disentangled representation can be used to build risk predictors that perform better across minority populations. We applied the proposed method to the analysis of Alzheimer's disease genetics. Comparing with standard linear and nonlinear risk prediction methods, the proposed method substantially improves risk prediction in minority populations, particularly for admixed individuals.
翻译:使用遗传数据的风险预测模型在基因组学中呈现出越来越大的牵引力,然而,大多数多原风险模型是利用具有类似(主要是欧洲)血统的参与者提供的数据开发的,这可能导致风险预测器的偏差,导致风险预测器在应用到少数群体人口和非洲裔美国人等混合人时,造成不完全的概括化。为了消除这一偏差,主要由于预测模型与基本人口结构混杂在一起,我们提议了一个新的深层次学习框架,利用不同人口的数据,并从其代表中分离出与苯型相关的资料。祖先的分解作用可以用来构建在少数群体人口中表现更好的风险预测器。我们采用拟议方法分析阿尔茨海默氏病遗传学。与标准的线性和非线性风险预测方法相比,拟议方法极大地改进了少数群体人口的风险预测,特别是粘合个体的风险预测。