Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select the reads that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the computation overhead, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different read lengths and error rates, and 2) different degrees of genetic variation. Through rigorous analysis of read mapping processes, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based SSD. Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05$\times$ (1.52-3.32$\times$) for read sets with high similarity to the reference genome and 1.45-33.63$\times$ (2.70-19.2$\times$) for read sets with low similarity to the reference genome.
翻译:在许多基因组应用中,阅读绘图是一个根本性的,但计算成本却非常昂贵的步骤。 它用于确定序列基因组和已知基因组(称为参考基因组)的碎片(所谓的参考基因组)之间的潜在匹配和差异。 为了应对基因组分析中的计算挑战,许多先前的著作提出了各种办法,例如选择必须进行昂贵计算、高效超力和硬件加速的读数的过滤器。在减少计算间接费用方面,所有这些办法仍然需要大量数据从存储到系统其他部分的花费,这可以大大降低常规和新兴基因组系统中读图的端到端的功能(所谓的“读数”)。 我们提议GenStore,这是为基因组序列分析设计的第一个存储处理系统,它通过利用低成本和准确的存储过滤过滤器,大大降低基因组序列分析的数据流动。 GenStore把硬件/软硬件联合用来应对储存处理的挑战,支持1个不同的读数和误算率, 读数- 和2个基因组内部的精确性变数(SD) 进行一个类似的实时数据分析,用S-deal-deal-deal develyal laction Scialalalal a exal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal develment sad sad sad sad sad sad sad sm sad sad sal deal dection)。