The most fundamental problem considered in algorithms for text processing is pattern matching: given a pattern $p$ of length $m$ and a text $t$ of length $n$, does $p$ occur in $t$? Multiple versions of this basic question have been considered, and by now we know algorithms that are fast both in practice and in theory. However, the rapid increase in the amount of generated and stored data brings the need of designing algorithms that operate directly on compressed representations of data. In the compressed pattern matching problem we are given a compressed representation of the text, with $n$ being the length of the compressed representation and $N$ being the length of the text, and an uncompressed pattern of length $m$. The most challenging (and yet relevant when working with highly repetitive data, say biological information) scenario is when the chosen compression method is capable of describing a string of exponential length (in the size of its representation). An elegant formalism for such a compression method is that of straight-line programs, which are simply context-free grammars describing exactly one string. While it has been known that compressed pattern matching problem can be solved in $O(m+n\log N)$ time for this compression method, designing a linear-time algorithm remained open. We resolve this open question by presenting an $O(n+m)$ time algorithm that, given a context-free grammar of size $n$ that produces a single string $t$ and a pattern $p$ of length $m$, decides whether $p$ occurs in $t$ as a substring. To this end, we devise improved solutions for the weighted ancestor problem and the substring concatenation problem.
翻译:在文本处理的算法中考虑的最根本问题是匹配模式:鉴于一个长度为1美元、长度为1美元、长度为1美元的模式,那么美元是否以美元为单位?这个基本问题的多种版本已经得到考虑,而且我们现在知道这个基本问题的计算方法在实践和理论上都是快速的。然而,生成和存储数据的数量的迅速增加使得需要设计直接用于压缩数据表达的算法。在压缩模式匹配问题的压缩模式中,我们得到的是文本的压缩代表,而美元是压缩代表制的长度,美元是文本的长度,美元是长度为1美元,而美元是未压缩的长度为1美元。最具有挑战性的(而且当与高度重复的数据一起工作时,生物信息)是当所选择的压缩方法能够描述指数长度(以其代表的大小为准)时,这种压缩方法的简单形式是直线程序,它只是从上下文的角度来描述一个字符串。虽然已经知道压缩模式比值为1美元的长度为1美元,但是在计算一个直径直方值时,对于一个直方的算法中,一个直径直径直的计算一个问题是一美元的方法是一美元。