Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of compressed data structures: as opposed to general automata, WDFAs can be stored in just $\log\sigma + O(1)$ bits per edge, $\sigma$ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to achieve further compression is minimization. When the input $\mathcal A$ is a general deterministic finite-state automaton (DFA), the state-of-the-art is represented by the classic Hopcroft's algorithm, which runs in $O(|\mathcal A|\log |\mathcal A|)$ time. This algorithm stands at the core of the only existing minimization algorithm for Wheeler DFAs, which inherits its complexity. In this work, we show that the minimum WDFA equivalent to a given input WDFA can be computed in linear $O(|\mathcal A|)$ time. When run on de Bruijn WDFAs built from real DNA datasets, an implementation of our algorithm reduces the number of nodes from 14% to 51% at a speed of more than 1 million nodes per second.
翻译:Wheeler DFAs (WDFAs) 是一个在压缩数据结构的新兴领域发挥重要作用的有限状态自动自动数据小分类: 与一般自动数据相比, WDFA 可以用纯$\log\sigma + O(1)美元比特/ 边缘存储, $\sigma$是字母的大小, 支持对所识别语言的子字符串关闭进行最佳时间匹配查询。 进一步压缩的一个重要步骤是最小化。 当输入 $\ mathcal A$ 是一般确定性固定状态自动数据( DFA) 时, 状态由经典的Hopcroft 算法代表, 以$( mascal Açálog + O(1)美元 美元/ mathcal A ⁇ ) 时间存储。 此算法是目前唯一最起码最短时间匹配其所识别语言的最小化算法的核心, 并在此工作中, 我们显示, 相当于给WDFA的最小值相当于WDFA的最小值可以用直线 $O (Zmacal A\) $ 51 a de de de max more time