Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6X (8.4X) speedup and 32.8X (20.8X) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39X speedup and 1.37X energy savings.
翻译:纳米测序是一种广泛使用的高通量基因组测序技术,可以以低成本将基因组的长片部分以低成本将基因组的长片分解排序成生电信号。 Nanopore测序需要两个计算成本的处理步骤,以进行准确的下游基因组分析。第一步,即底调,将生电信号转化成核酸基底(即,A、C、G、T)。第二步,读图,在参考基因组中找到读取的正确位置。在现有的基因组分析管道中,将基调调和读图分开执行。我们注意到,在这项工作中,分别执行两个最耗时的步骤必然导致:(1) 大量数据移动和(2) 重复计算数据,放慢基因组分析管道的进度。本文提出GenPIPIP,这是一个将原始基因组分析的加速加速器,可以密切整合基底调和读图。 GenPIPIP改进基因组分析管道的性能表现,有两个关键机制:(1) 内试算,基底线是精度,底线,底线和读取主要基因组分析步骤;(2) 进行新的技术,用于早期基因组测序分析,用于早期测序分析。 (8.X) 预算,预算,预算,预算,预算,预算,直序分析。