生物医疗大数据集成分析的统计与计算方法研究

项目名称： 生物医疗大数据集成分析的统计与计算方法研究

项目编号： No.61501389

项目类型： 青年科学基金项目

立项/批准年度： 2016

项目学科： 无线电电子学、电信技术

项目作者： 杨灿

作者单位： 香港科技大学深圳研究院

项目金额： 21万元

中文摘要： 全基因组关联分析成功地找出了上万个与人类表型相关的遗传变异，这些表型包括疾病（糖尿病、精神病）和非疾病性状（身高、体重、血压）。然而，由于缺乏不同层面的生物学数据的交叉验证，从遗传变异到表型的因果链条上的许多环节还并不十分清楚。生物医疗大数据为我们刻画了各个层面的生命过程，包括基因组、表观基因组、转录组、蛋白质组和代谢组。如何有效地整合多层面的数据成为打造完整因果链的关键。..本项目致力于统计与计算方法的开发，以服务于多层面的数据集成分析。这些方法的研究基于两方面事实：(1)遗传变异的多效性（即一个变异会影响多种表型）；(2)非编码遗传变异的调节功能。由此我们提出三步曲的方法研究：(1)多种疾病的全基因组数据集成；(2) 单个疾病的全基因组数据与生物功能型数据的集成；(3)多种疾病的全基因组数据与功能型数据的集成。我们期望本课题的统计与计算方法研究能够为更多的大数据分析领域提供新的思路。

中文关键词： 生物医疗大数据挖掘；全基因组关联分析；多组学数据融合；遗传变异多效性；生物学功能型数据

英文摘要： Genome-wide association studies have identified more than ten thousands of genetic risk variants associated with complex human phenotypes, including human diseases (e.g., diabetes, psychiatric disorders) and non-disease traits (e.g., height, weight, blood pressure). However, complete chains of causality that links genetic variants to phenotypes remain largely elusive due to the lack of cross-validation from different types of functional data. ..The rise of Big Data in biomedicine offers us unprecedented opportunities to build up such complete chains. In contrast to conventional data in Biomedicine, these datasets characterize the biological processes at different layers, including genome, epigenome, transcriptome, proteome and metabolome. How to integrate these multilayered data becomes an essential step to deepen our understanding of biological basis of complex diseases. ..In this research, we aim at developing statistical and computational methods for prioritizing disease-associated variants via integrative analysis of multilayered data. This research is motivated by the following facts: (1) accumulating evidence suggests that different complex traits/diseases share common genetic bases, which is formally known as “pleiotropy”; and (2) functionally relevant variants have been consistently demonstrated to be enriched among GWAS findings. In our pilot study, preliminary results suggest that we can benefit a lot from joint analysis of two GWAS datasets. To continue this promising research, here we propose a stage-wise research strategy for further development of our methods: (1) joint analysis of multiple GWAS dataset; (2) incorporation of functional annotation data into one GWAS data analysis and (3) joint analysis of multiple GWAS data with incorporation of functional annotation data...The novelty of this research is that a statistically rigorous and computationally efficient methods are developed to integrate multilayered data. This helps make the most efficient use of the vast amounts of valuable data that have been generated to dissect complex disease genetics. In contrast to most existing methods that simply combine multilayered data without considering the biological processes, our proposed methods allow sharing indirect information at different layers. This will greatly facilitate biologically interpretable inference and drive new scientific insights. The statistical and computational skills developed here are also broadly applicable to many other disciplines where diverse, rich, and multilayered data are available to address challenging scientific problems.

英文关键词： Mining Big Data in Biomedicine;genome-wide association studies;integrative analysis of omics data;pleiotropy;Biologically functional data

成为VIP会员查看完整内容