An extended regular expression $R$ specifies a set of strings formed by characters from an alphabet combined with concatenation, union, intersection, complement, and star operators. Given an extended regular expression $R$ and a string $Q$, the extended regular expression matching problem is to decide if $Q$ matches any of the strings specified by $R$. Extended regular expressions are a basic concept in formal language theory and a basic primitive for searching and processing data. Extended regular expression matching was introduced by Hopcroft and Ullmann in the 1970s [\textit{Introduction to Automata Theory, Languages and Computation}, 1979], who gave a simple dynamic programming solution using $O(n^3m)$ time and $O(n^2m)$ space, where $n$ is the length of $Q$ and $m$ is the length of $R$. Since then, several solutions have been proposed, but few significant asymptotic improvements have been obtained. The current state-of-the art solution, by Yamamoto and Miyazaki~[COCOON, 2003], uses $O(\frac{n^3k + n^2m}{w} + n + m)$ time and $O(\frac{n^2k + nm}{w} + n + m)$ space, where $k$ is the number of negation and complement operators in $R$ and $w$ is the number of bits in a word. This roughly replaces the $m$ factor with $k$ in the dominant terms of both the space and time bounds of the Hopcroft and Ullmann algorithm. We revisit the problem and present a new solution that significantly improves the previous time and space bounds. Our main result is a new algorithm that solves extended regular expression matching in \[O\left(n^\omega k + \frac{n^2m}{\min(w/\log w, \log n)} + m\right)\] time and $O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$ space, where $\omega \approx 2.3716$ is the exponent of matrix multiplication. Essentially, this replaces the dominant $n^3k$ term with $n^\omega k$ in the time bound, while simultaneously improving the $n^2k$ term in the space to $O(n^2)$.
翻译:扩展正则表达式$R$通过字母表中的字符与连接、并集、交集、补集和星号运算符的组合,定义了一组字符串。给定扩展正则表达式$R$和字符串$Q$,扩展正则表达式匹配问题旨在判定$Q$是否匹配$R$所定义的任何字符串。扩展正则表达式是形式语言理论中的基本概念,也是数据搜索与处理的基础原语。扩展正则表达式匹配由Hopcroft和Ullmann于20世纪70年代提出[\textit{Introduction to Automata Theory, Languages and Computation}, 1979],他们给出了一种简单的动态规划解法,其时间复杂度为$O(n^3m)$,空间复杂度为$O(n^2m)$,其中$n$为$Q$的长度,$m$为$R$的长度。此后,学界提出了多种解决方案,但在渐进复杂度上鲜有显著改进。当前最先进的解决方案由Yamamoto和Miyazaki提出~[COCOON, 2003],其时间复杂度为$O(\frac{n^3k + n^2m}{w} + n + m)$,空间复杂度为$O(\frac{n^2k + nm}{w} + n + m)$,其中$k$为$R$中否定和补集运算符的数量,$w$为字长位数。该方案大致将Hopcroft和Ullmann算法时空复杂度主导项中的$m$因子替换为$k$。我们重新审视该问题,并提出一种显著改进先前时空复杂度界限的新解决方案。我们的主要成果是一种新算法,其求解扩展正则表达式匹配的时间复杂度为\[O\left(n^\omega k + \frac{n^2m}{\min(w/\log w, \log n)} + m\right)\],空间复杂度为$O(\frac{n^2 \log k}{w} + n + m) = O(n^2 +m)$,其中$\omega \approx 2.3716$为矩阵乘法指数。本质上,该算法将时间复杂度主导项中的$n^3k$项替换为$n^\omega k$,同时将空间复杂度中的$n^2k$项改进为$O(n^2)$。