We present an algorithm for regular expression parsing and submatch extraction based on tagged deterministic finite automata. The algorithm works with different disambiguation policies. We give detailed pseudocode for the algorithm, covering important practical optimizations. All transformations from a regular expression to an optimized automaton are explained on a step-by-step example. We consider both ahead-of-time and just-in-time determinization and describe variants of the algorithm suited to each setting. We provide benchmarks showing that the algorithm is very fast in practice. Our research is based on two independent implementations: an open-source lexer generator RE2C and an experimental Java library.
翻译:我们提出了一个基于有标记的确定性有限自动数据进行正常表达式分析的算法和亚匹配提取法。 算法与不同的模糊化政策起作用。 我们给算法提供详细的假码, 包括重要的实际优化。 所有从正常表达式转换到优化自动图的转换都是通过一个逐步的示例来解释的。 我们考虑先期和即时确定性, 并描述适合每个设置的算法变量。 我们提供基准, 表明算法在实际操作中非常快速。 我们的研究基于两个独立的实施: 开放源源码生成 RE2C 和一个实验性 Java 图书馆。