项目名称: 基于云计算平台的下一代测序数据错误修正算法研究与实现
项目编号: No.31501070
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 生物科学
项目作者: 赵亮
作者单位: 广西大学
项目金额: 19万元
中文摘要: 下一代测序数据正以革新化的态势影响着生命科学、医学等相关领域的研究,因其能从根本上揭示这些外在表象的内在本质。然而,由于受到测序平台、测序方法以及基因自身序列结构的影响,测序数据都或多或少存在着替换错误和插入删除错误。这些错误对后续数据分析提出了极大挑战。现有错误修正方法要么只能处理小数据,要么降低准确度来处理大规模数据。鉴于此,该课题设计一种基于云计算平台的、可处理超大规模数据集的、同时保证处理精确度的分布式并行算法。该算法把MapReduce分布式思想和overlap-layout-consensus数据处理模型有机的结合起来,同时利用统计模型修改测序错误。其主要优点体现在:MapReduce思想可以分布式并行处理超大规模数据;overlap-layout-consensus模型可以保持数据的完整型;利用统计模型修改错误碱基保证算法准确性。
中文关键词: 错误修正;下一代测序;云计算
英文摘要: Next-generation sequencing data is making essential impact on the biological and biomedical studies due to its ability in discovering the relationship between genotypes and phenotypes per se. However, the data contains sequencing errors inevitably because of the bias introduced by the sequencing platforms and approaches. These errors, substitutions, insertions and deletions, pose great challenge for data analysis. Existing error correction approaches partially solve the problem by only handling small data or reduce the performance to cope with large data. To solve this problem, we propose an algorithm that can handle large dataset while keep good performance running on cloud computing platform. This algorithm smoothly combines MapReduce and overlap-layout-consensus model together, and corrects errors by a classical statistical model. The advantages of the model are in three fold: MapReduce model can handle huge volume of dataset; overlap-layout-consensus model keeps the intactness of input data and; the statistical model guarantees the good performance.
英文关键词: Error correction;Next-generation sequencing;Cloud computing