项目名称: 基于伪度量空间分割树的超高深度测序比对处理与定量基因组学分析
项目编号: No.31200995
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 遗传学与生物信息学、细胞生物学
项目作者: 蔡云鹏
作者单位: 中国科学院深圳先进技术研究院
项目金额: 20万元
中文摘要: 超高深度测序是深入研究生物基因组进化机制细节所需要采取的一类高通量测序手段。传统的测序数据处理手段在计算速度和精确度上都存在严重缺陷,无法满足目前超高深度测序的数据吞吐量要求。本项目利用伪度量空间分割树、多维标度分析和动态最近邻点对查找的思想,借助高性能并行计算,提出并实现对海量超高深度测序数据进行高效精确比对、纠错和聚类的原创性方法,力争在国际上率先实现千万条以上焦磷酸测序序列的精确比对与聚类。在此基础上,通过将聚类结果进行数值向量化以及借助数据挖掘技术,提出并实现一套对多样本测序数据进行量化基因组分析、发掘其中所蕴含的生物学规律的分析方法,解决一系列在计算机科学和生物信息学领域具有普遍意义的课题。本项目的研究成果体现为处理超高深度测序数据的一系列计算方法、处理流程以及工具软件,为基因组学和宏基因组学研究提供强有力的方法论支持和工具支持。
中文关键词: 深度测序;序列比对;序列聚类;宏基因组学;
英文摘要: Ultra-deep sequencing is a type of next-generation sequencing approach for investigating the genetic details of the evolutionary mechanisms in life science. Traditional methods for the processing of sequencing data is quite limited in computational speed and have severe defects regarding accuracy, which are incapable of handling the large amount of data produced by ultra-deep sequencing nowadays. In this project we propose an efficient and novel method for accurate alignment, error-correction and clustering of ultra-deep sequencing data, based on the ideas of pseudo-metric space partitioning tree, multidimensional scaling and dynamic closest pair search, and with the aid of parallel computing. The aim of the project is to conquer the data size of over 10 million pyrosequencing reads, which will be a premium performance among the state-of-the-art. We also propose a pipeline of performing quantitative genomics analyses and exploring meaningful biology discoveries based on the achieved clustering results by applying numericalization techniques and adopting advance data mining methods. The implementation of the project will lead to a series of computational methods, pipelines and software for processing of ultra-deep sequencing data, which will provide powerful supports to genomics and meta-genomics research in the
英文关键词: deep-sequencing;sequence alignment;sequence clustering;metagenomics;