Pattern matching is a fundamental process in almost every scientific domain. The problem involves finding the positions of a given pattern (usually of short length) in a reference stream of data (usually of large length). The matching can be as an exact or as an approximate (inexact) matching. Exact matching is to search for the pattern without allowing for mismatches (or insertions and deletions) of one or more characters in the pattern), while approximate matching is the opposite. For exact matching, several data structures that can be built in linear time and space are used and in practice nowadays. For approximate matching, the solutions proposed to solve this matching are non-linear and currently impractical. In this paper, we designed and implemented a structure that can be built in linear time and space and solve the approximate matching problem in ($O(m + \frac {log_\Sigma ^kn}{k!} + occ$) search costs, where $m$ is the length of the pattern, $n$ is the length of the reference, and $k$ is the number of tolerated mismatches (and insertion and deletions).
翻译:几乎每个科学领域都存在模式匹配的基本过程。 问题在于找到数据参考流中特定模式( 通常是短长度的) 的位置( 通常是长长的 ) 。 匹配可以是精确的, 也可以是近似( 不精确的) 匹配 。 精确匹配是寻找模式, 不允许一个或一个以上字符在模式中出现不匹配( 插入和删除 ), 而近似匹配是相反的 。 对于精确匹配, 使用几个可以以线性时间和空间构建的数据结构, 并在目前实际操作中使用。 对于近似匹配, 为解决这一匹配而提出的解决方案是非线性且目前不切实际的 。 在本文中, 我们设计和实施了一个可以在线性时间和空间中构建的结构, 并解决在( $( m) +\ frac { {log\ SIgma {kn}} + occ$ 搜索成本, 其中, 美元是模式的长度, $n is the long of the refirn, 而 $ 是可容忍的错配配配( 和删除) ) 。