Aligning a sequence to a walk in a labeled graph is a problem of fundamental importance to Computational Biology. For finding a walk in an arbitrary graph with $|E|$ edges that exactly matches a pattern of length $m$, a lower bound based on the Strong Exponential Time Hypothesis (SETH) implies an algorithm significantly faster than $O(|E|m)$ time is unlikely [Equi et al., ICALP 2019]. However, for many special graphs, such as de Bruijn graphs, the problem can be solved in linear time [Bowe et al., WABI 2012]. For approximate matching, the picture is more complex. When edits (substitutions, insertions, and deletions) are only allowed to the pattern, or when the graph is acyclic, the problem is again solvable in $O(|E|m)$ time. When edits are allowed to arbitrary cyclic graphs, the problem becomes NP-complete, even on binary alphabets [Jain et al., RECOMB 2019]. These results hold even when edits are restricted to only substitutions. The complexity of approximate pattern matching on de Bruijn graphs remained open. We investigate this problem and show that the properties that make de Bruijn graphs amenable to efficient exact pattern matching do not extend to approximate matching, even when restricted to the substitutions only case with alphabet size four. We prove that determining the existence of a matching walk in a de Bruijn graph is NP-complete when substitutions are allowed to the graph. In addition, we demonstrate that an algorithm significantly faster than $O(|E|m)$ is unlikely for de Bruijn graphs in the case where only substitutions are allowed to the pattern. This stands in contrast to pattern-to-text matching where exact matching is solvable in linear time, like on de Bruijn graphs, but approximate matching under substitutions is solvable in subquadratic $O(n\sqrt{m})$ time, where $n$ is the text's length [Abrahamson, SIAM J. Computing 1987].
翻译:将序列对齐到标签式图表中行走是一个对计算生物学具有根本重要性的问题。 如果在任意的图表中找到一个符合长度型态的 $E $E $ 美元边缘的行走方式, 则基于“ 强烈的指数时间假设” (SETH) 的较低约束值意味着算法大大快于$O( ⁇ E ⁇ m) 美元的时间是不太可能的 [Equi et al., ICLP 2019] 。 然而, 对于许多特殊图表, 如 de Bruijn 图形, 问题可以在线性时间上解决 [Boi et al., WABIF 2012] 。 对于近似匹配来说, 图片比较复杂。 当编辑( 替代、 插入和删除) 以“ 强烈的 时间模型”, 问题在 $O( etqualqual ) 时间上再次可以缓解。 当允许任意的更替性图表中, 问题就变成了NP, 甚至是双数的字母, 。 直线性 直线性 直值 。 直值 直径 直值 直值 直值 直值 直值, 直值 直值 直值 直到 直值 直到 直值 直到 直值 直值 直到 直到 直到 直到 直到 直到 直到 直到 直到 直到 直到 。