In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches (MEMs) and Maximal Unique Matches (MUMs) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the $r$-index that is a Burrows-Wheeler Transform (BWT)-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the $r$-index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.'s approach to enable the computation of MUMs on the $r$-index, while preserving the space and time bounds. We add additional $O(r)$ samples of the longest common prefix (LCP) array, where $r$ is the number of equal-letter runs of the BWT, that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs. We implemented a proof-of-concept of our approach, that we call mum-phinder, and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs. We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.
翻译:近些年来,全景体因其吸收人口变异信息和减轻参考基因组偏差的能力而日益受到科学界的注意。 最大Exact Matches (MEMS) 和 Maximal Unique Matches (MMMS) 已经证明自己在多种生物信息背景下是有用的, 例如短读校对齐和多基因校正。 但是, 使用futix树和调频指数的标准技术并没有达到全景层水平。 最近, Gagie et al. [JACM 20] 引入了 $-index ($-Wheeler 变换(BWT) 以美元为基础的指数可以处理数百个人类基因组。 后来, Rossi et al(JCBCB 22) 使MMMMMM(MUM) 的计算方法能够用得更多。 我们的货币计算方法在我们货币和货币的货币计算中, 以最接近的货币的货币的货币的货币计算方法, 我们的货币的货币的货币的货币计算方法可以持续到比数。