松散正则表达式匹配 (Sparse Regular Expression Matching)

We revisit the classic regular expression matching problem, that is, given a regular expression $R$ and a string $Q$, decide if $Q$ matches any of the strings specified by $R$. A standard textbook solution [Thompson, CACM 1968] solves this problem in $O(nm)$ time, where $n$ is the length of $Q$ and $m$ is the number of characters in $R$. More recently, several results that improve this bound by polylogarithmic factor have appeared. All of these solutions are essentially based on constructing and simulation a non-deterministic finite automaton. On the other hand, assuming the strong exponential time hypotheses we cannot solve regular expression $O((nm)^{1-\epsilon})$ [Backurs and Indyk, FOCS 2016]. Hence, a natural question is if we can design algorithms that can take advantage of other parameters of the problem to obtain more fine-grained bounds. We present the first algorithm for regular expression matching that can take advantage of sparsity of the automaton simulation. More precisely, we define the \emph{density}, $\Delta$, of the instance to be the total number of states in a simulation of a natural automaton for $R$. The density is always at most $nm+1$ but may be significantly smaller for many typical scenarios, e.g., when a string only matches a small part of the regular expression. Our main result is a new algorithm that solves the problem in $$O\left(\Delta \log \log \frac{nm}{\Delta} + n + m\right)$$ time. This result essentially replaces $nm$ with $\Delta$ in the complexity of regular expression matching. Prior to this work no non-trivial bound in terms of $\Delta$ was known. The key technical contribution is a new linear space representation of the classic position automaton that supports fast state-set transition computation in near-linear time in the size of the input and output state sets.

翻译：我们重新审视典型的常规表达式匹配问题, 也就是说, 给一个常规表达式 $ 美元和字符串 $ 美元, 确定 $Q 美元是否匹配由美元。标准教科书解决方案 [Thompson, CACM 1968] 以美元( 美元) 时间解决了这一问题, 美元长度为 $, 美元是字符数 $ R$ 。最近, 出现了几个改善这个约束的调控因素。所有这些解决方案基本上基于构建和模拟一个非确定性的自动表达式 $ 美元。另一方面, 假设一个强大的时间假设我们无法解决常规表达式 $ (nm) 美元的美元( CACM, CACM 1968) 。因此, 一个自然的问题是, 我们能否设计能够利用问题的其他参数来获取更精细的框。我们总是使用一种符合常规表达式的算法, 来利用美元总模拟的美元。。更确切地说, 当一个自然地说, 时间里, 我们定义一个状态的结果是。。