基因储存：用于基因组序列分析的高性能和节能的存储计算系统 (GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis)

Nika Mansouri Ghiasi,Jisung Park,Harun Mustafa,Jeremie Kim,Ataberk Olgun,Arvid Gollwitzer,Damla Senol Cali,Can Firtina,Haiyu Mao,Nour Almadhoun Alserr,Rachata Ausavarungnirun,Nandita Vijaykumar,Mohammed Alser,Onur Mutlu

from arxiv, Published at ASPLOS 2022

Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select the reads that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the computation overhead, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different read lengths and error rates, and 2) different degrees of genetic variation. Through rigorous analysis of read mapping processes, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based SSD. Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05$\times$ (1.52-3.32$\times$) for read sets with high similarity to the reference genome and 1.45-33.63$\times$ (2.70-19.2$\times$) for read sets with low similarity to the reference genome.

翻译：读映射是许多基因组学应用程序中基本但计算密集的步骤。它用于识别已知基因组（称为参考基因组）的碎片（称为读）和测序基因组的潜在匹配和差异。为了解决基因组分析中的计算挑战，许多先前的工作提出了各种方法，例如选择必须进行昂贵计算的读取的过滤器，有效的启发式和硬件加速。虽然这些方法在降低计算开销方面很有效，但仍需要大量的数据从存储移动到系统的其他部分，这可能会显着降低传统和新兴基因组系统中的端到端性能。我们提出了GenStore，首个专为基因组序列分析设计的存储处理系统，通过利用低成本和精度高的存储过滤器大大减少基因组序列分析的数据移动和计算开销。GenStore利用硬件/软件协同设计来解决存储处理的挑战，支持读长度和误差率不同以及遗传变异程度不同的读。通过对读映射流程的严格分析，我们在基于NAND闪存的SSD中精心设计了低成本的硬件加速器和数据/计算流程。我们使用各种真实基因组数据集的评估表明，当在三个现代SSD中实现时，GenStore的读映射性能比现有软件（硬件）基线的性能提高了2.07-6.05倍（1.52-3.32倍）适用于与参考基因组高相似度的读取集和1.45-33.63倍（2.70-19.2倍）适用于与参考基因组低相似度的读取集。