基于高通量测序数据研究基因组变异的统计问题

项目名称： 基于高通量测序数据研究基因组变异的统计问题

项目编号： No.11471022

项目类型： 面上项目

立项/批准年度： 2015

项目学科： 数理科学和化学

项目作者： 席瑞斌

作者单位： 北京大学

项目金额： 60万元

中文摘要： 人类基因组中有包括结构变异在内的多种变异，它们对人类的健康有重大影响。癌基因组通常比正常基因组有更多的变异，其中一些可能对肿瘤生成起到了关键的作用。近年来，高通量测序技术的革命性突破为我们提供了一个高效的研究基因组变异的平台，但其带来的数据爆炸性增长对我们的统计计算分析能力提出了严峻的挑战。特别地，由于结构变异的复杂性及高通量测序数据读长太短及分布不均匀的缺点，目前探测结构变异算法在其准确度及灵敏度方面仍有很大的局限。在本项目中，我们将针对基于高通量测序数据研究和分析结构变异的一些问题展开研究，发展稳健的概率统计模型及高效的算法，并研究其对应的统计性质。我们将主要通过建立一些半参数模型或贝叶斯模型来解决这些问题。同时，我们将充分考虑所面临问题的具体情况以建立更加符合实际情况的模型。我们还将发展对应的统计软件包以方便其他学者使用，并会将这些算法应用到实际数据中以获取新的生物学知识。

中文关键词： 统计计算；半参数模型；生物统计；高维数据；高通量测序

英文摘要： Genomic variations such as structural variations (SV) are widespread in human genomes and they may confer susceptibility to various diseases. Cancer genomes often have significantly more genomic variations, some of which may play important role in tumurigenesis. The breathtaking development of the high-throughput sequencing (HTS) technology has provided a highly efficient platform for studying genomic variations in human genomes, but it also brings great challenges for statistical analysis of the HTS data. Especially, due to the complexity of SVs, the short read length and various biases in the HTS data, current algorithms for detecting and analyzing SVs still have limited sensitivity and specificity. In this project, we will develop a set of robust statistical models and computational tools based on HTS data to comprehensively detect and characterize SVs in human genomes, especially in human cancer genomes. The models being developed will mainly be semi-parametric or Bayesian models. We will take full consideration of the biological knowledge whiling developing these models to make our methods fit better to the real situation. Software/packages will also be developed to make these algorithms easily accessible to other investigators. We will also apply these algorihtms to thousands of genomes seqeunced at The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) to gain new knowledge about cancer genomes.

英文关键词： Statistical Computation;Semiparametric model;biostatistics;High dimensional Data;high-throughput sequencing

成为VIP会员查看完整内容