A critical step of genome sequence analysis is the mapping of sequenced DNA fragments (i.e., reads) collected from an individual to a known linear reference genome sequence (i.e., sequence-to-sequence mapping). Recent works replace the linear reference sequence with a graph-based representation of the reference genome, which captures the genetic variations and diversity across many individuals in a population. Mapping reads to the graph-based reference genome (i.e., sequence-to-graph mapping) results in notable quality improvements in genome analysis. Unfortunately, while sequence-to-sequence mapping is well studied with many available tools and accelerators, sequence-to-graph mapping is a more difficult computational problem, with a much smaller number of practical software tools currently available. We analyze two state-of-the-art sequence-to-graph mapping tools and reveal four key issues. We find that there is a pressing need to have a specialized, high-performance, scalable, and low-cost algorithm/hardware co-design that alleviates bottlenecks in both the seeding and alignment steps of sequence-to-graph mapping. To this end, we propose SeGraM, a universal algorithm/hardware co-designed genomic mapping accelerator that can effectively and efficiently support both sequence-to-graph mapping and sequence-to-sequence mapping, for both short and long reads. To our knowledge, SeGraM is the first algorithm/hardware co-design for accelerating sequence-to-graph mapping. SeGraM consists of two main components: (1) MinSeed, the first minimizer-based seeding accelerator; and (2) BitAlign, the first bitvector-based sequence-to-graph alignment accelerator. We demonstrate that SeGraM provides significant improvements for multiple steps of the sequence-to-graph and sequence-to-sequence mapping pipelines.
翻译:基因组序列分析的关键步骤是绘制基因组分析的序列DNA碎片(即,读取),从个人收集到已知的线性参考基因组序列(即,从序列到序列的绘图)的绘图。最近的工作用参考基因组的图形表示取代线性参考序列,以图表形式显示许多人口中个人的基因变异和多样性。绘图读到基于图形的参考基因组(即,从序列到绘图),从而在基因组分析中取得显著的质量改进。不幸的是,虽然通过许多可用的工具和加速器和已知的线性参考基因组序列(即,从序列到序列的绘图)。最近的工作用参考基因组的图表表示线性参考序列序列序列,以图表为基础的基因组(即,从序列到绘图到绘图的顺序),并揭示四个关键问题。我们发现,迫切需要有一个专业化的、高性能、最短性能和低成本的算法/硬性能改进。 (一) 测序到测序和测序-测算-测算-测算-第一个测算-第一个测算-第一个测算-第一个测算-先测算/测算-先测算-测算/测算-先测算-先测算-测算-先测算和两个测算-测序-测算-测算-先测算/测算-先测算/测算/测算-先测算和后算。