We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a run-length-encoded string collection containing the HiFi reads along with their reverse complements. Subsequently, it splits the encoding into two arrays, one storing the sequence of symbols for equal-symbol runs and another storing the run lengths. The purpose of the split is to get the BWT of the run symbols and reorder their lengths accordingly. We show that this special BWT, as it encodes the HiFi reads and their reverse complements, supports bi-directional queries for the HiFi reads. Then, we propose a variation of the MEM algorithm of Belazzougui et al. (2013) that exploits the run-length encoding and the implicit bi-directional property of our BWT to compute approximate MEMs. Concretely, if the algorithm finds that two substrings, $a_1 \ldots a_p$ and $b_1 \ldots b_p$, have a MEM, then it reports the MEM only if their corresponding length sequences, $\ell^{a}_1 \ldots \ell^{a}_p$ and $\ell^{b}_1 \ldots \ell^{b}_p$, do not differ beyond an input threshold. We use a simple metric to calculate the similarity of the length sequences that we call the {\em run-length excess}. Our technique facilitates the detection of MEMs with homopolymer errors as it does not require dynamic programming to find approximate matches where the only edits are the lengths of the equal-symbol runs. Finally, we present a method that relies on a geometric data structure to report the text occurrences of the MEMs detected by our algorithm.
翻译:我们描述一个在 HiFi 中找到最大精确匹配( MEMs) 的算法, 在 HiFi 中找到最高精确匹配值( MEMs), 是同质聚合错误。 我们工作的主要新颖之处是, 我们使用运行时间压缩来帮助处理错误。 我们的方法作为输入接收包含 HiFi 的运行长编码字符串收藏及其反向补充。 随后, 我们的方法将编码分成两个阵列, 一个为等义运行存储符号序列, 另一个存储运行长度。 拆分的目的是获取运行符号的 BWT, 并相应重新排序。 我们显示, 这个特殊的 BWT, 因为它将 HiFi 读起来和它们的反向补补补, 支持 HiFi 读的双向解码查询 。 然后, 我们提议对 Belazzougui 和 al. 2013 的 MEM 算法进行更改, 利用运行运行运行的运行时间编码和隐隐含双向的双向特性, MIMMMM, 如果算法发现两个子, 我们的计算方法是以简单的 = dal_ dal_ p$; 。