De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-throughput genome sequencing and thus accelerates the discovery of new genomes. In this paper, we present a toolkit, called PPA-assembler, for de novo genome assembly in a distributed setting. The operations in our toolkit provide strong performance guarantees, and can be assembled to implement various sequencing strategies. PPA-assembler adopts the popular {\em de Bruijn graph} based approach for sequencing, and each operation is implemented as a program in Google's Pregel framework for big graph processing. Experiments on large real and simulated datasets demonstrate that PPA-assembler is much more efficient than the state-of-the-arts and provides good sequencing quality.
翻译:德诺伏基因组组组装是缝合短DNA序列以生成更长的DNA序列的过程,不使用任何参照序列进行校准。它能够进行高通量基因组测序,从而加速发现新的基因组。在本文中,我们提出了一个工具包,称为PPA-assembler,用于在分布式环境中重新进行基因组组组装。我们工具包中的操作提供了强大的性能保障,可以集中实施各种测序战略。 PPA-组装器采用流行的 em de Bruijn 图形 方法进行测序,每次操作都作为Google的大图处理Pregel 框架的一个程序进行。对大型真实和模拟数据集的实验表明,PPA-assembetler比最新工艺效率高得多,并提供了良好的测序质量。