Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of \emph{double-array Aho--Corasick automata (DAACs)}, data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many implementation techniques have been proposed thus far. A problem in DAACs is that their ideas are not aggregated. Since comprehensive descriptions and experimental analyses are unavailable, engineers face difficulties in implementing an efficient DAAC. In this paper, we review implementation techniques for DAACs and provide a comprehensive description of them. We also propose several new techniques for further improvement. We conduct exhaustive experiments through real-world datasets and reveal the best combination of techniques to achieve a higher performance in DAACs. The best combination is different from those used in the most popular libraries of DAACs, which demonstrates that their performance can be further enhanced. On the basis of our experimental analysis, we developed a new Rust library for fast multiple pattern matching using DAACs, named \emph{Daachorse}, as open-source software at \url{https://github.com/daac-tools/daachorse}. Experiments demonstrate that Daachorse outperforms other AC-automaton implementations, indicating its suitability as a fast alternative for multiple pattern matching in many applications.
翻译:字符串中的多重匹配模式是文本处理应用程序(如常规表达式或象征化)中的一个基本问题。 本文研究的是 \ emph{ 双- rary Aho- Corasick Automata (DAACs) 的高效实施, 即快速进行多模式匹配的数据结构 。 DACs 的实际性能通过仔细设计数据结构得到改进, 并且到目前为止已经提出了许多执行技术。 DACs 的一个问题是, 他们的想法没有综合起来。 由于没有全面的描述和实验分析, 工程师在执行高效的 DAAC 时面临困难。 在本文中, 我们审查DAACs 的实施技术, 并提供对这些技术的全面描述。 我们还提出了几项新的技术, 以供进一步改进。 我们通过真实世界的数据集进行彻底的实验, 并展示各种技术的最佳组合, 以便在DAACs最受欢迎的图书馆中实现更高的性能。 最佳组合表明, 他们的性能可以进一步加强。 根据我们的实验分析, 我们开发了一个新的 Rust 图书馆, 用于快速匹配 DAACs, 命名为\ emphper{Dabormabal abs abas) 。