Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named Daachorse, as open-source software at https://github.com/daac-tools/daachorse. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.
翻译:字符串中的多重匹配模式是文本处理应用程序(如常规表达式或象征化)的根本问题。 本文研究双轨 Aho- Corasick 自动匹配( DAACs) 的有效实施, 数据结构的高效实施, 快速进行多种模式匹配的数据结构。 通过仔细设计数据结构, 迄今已提出了许多执行技术, 使DAACs的实际表现得到改善。 DAACs的一个问题是, 它们的想法没有综合起来。 由于没有全面的描述和实验分析, 工程师在执行高效的 DAAC 时面临困难。 在本文中, 我们审查DAACs 的实施技术, 并提供对这些技术的全面描述。 我们还提出了若干新的技术, 以供进一步改进。 我们通过真实世界数据集进行详尽的实验, 并展示在DAACs实现更高性业绩的最佳技术组合。 最佳组合不同于最受欢迎的DAACs图书馆所使用的技术, 这表明其业绩可以进一步加强。 根据我们的实验分析, 我们开发了一个新的 Rustrial 图书馆, 以快速多重模式匹配 Daachomas, 作为在 http://giustormas ex fastmental- ex fastard exestalestable atravelopmentals ex pas pas pas pas pas pas paputaltimalmentaldroutmental.