De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed-memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences. Our algorithm shows good scaling with parallel efficiency up to 80% on 128 nodes, resulting in uniform genome coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.
翻译:Denovo 基因组组组,即从冗余和错误的短序序列中重建未知基因组序列序列,是一个关键,但却是在许多基因组输油管中计算密集的步骤。基因组数据的指数增长正在增加计算需求,需要可缩放的高性能方法。在此工作中,我们提出了一个新颖的分布式模拟算法,从基因组的字符串图表示和使用稀薄的矩阵,生成一个配置组,即形成代表染色体区域的地图的重叠序列。我们使用矩阵抽象,隐藏字符串图中的分支,并将连接的组件编译为属于同一线组的组组组(即,连接式)序列。然后,我们进行多种分布式数字分配,以尽量减少本地组群的负荷不平衡,即从某个配置组团的序列排列,根据分解获得的任务,我们算出导子组函数函数在两个进程之间重新排列序列序列的顺序,从而形成一组分散式矩阵。最后,我们将每个组合的组合组合组合的组合组合组合组(即,将每个序列的精细缩缩缩缩的缩缩缩缩缩缩缩缩缩图,以显示我们80的顺序的缩缩缩缩缩缩缩缩的缩的缩图。