新方案突破构建压缩后缀数组和后缀树所需 $O(n)$ 的限制 (Breaking the $O(n)$-Barrier in the Construction of Compressed Suffix Arrays and Suffix Trees)

The suffix array and the suffix tree are the two most fundamental data structures for string processing. For a length-$n$ text, however, they use $\Theta(n \log n)$ bits of space, which is often too costly. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. Sadakane [SODA 2002] then showed how to augment them to obtain the compressed suffix tree (CST). For a length-$n$ text over an alphabet of size $\sigma$, these structures use only $O(n\log\sigma)$ bits. The biggest remaining open question is how efficiently they can be constructed. After two decades, the fastest algorithms still run in $O(n)$ time [Hon et al., FOCS 2003], which is $\Theta(\log_{\sigma} n)$ factor away from the lower bound of $\Omega(n/\log_{\sigma}n)$. In this paper, we make the first in 20 years improvement in $n$ for this problem by proposing a new compressed suffix array and a new compressed suffix tree which admit $o(n)$-time construction algorithms while matching the space bounds and the query times of the original CSA/CST and the FM-index. More precisely, our structures take $O(n\log\sigma)$ bits, support SA queries and full suffix tree functionality in $O(\log^{\epsilon}n)$ time per operation, and can be constructed in $O(n \min(1,\log\sigma/\sqrt{\log n}))$ time using $O(n\log\sigma)$ bits of working space. We derive this result as a corollary from a much more general reduction: We prove that all parameters of a compressed suffix array/tree (query time, space, construction time, and construction working space) can essentially be reduced to those of a data structure answering new query types that we call prefix rank and prefix selection. Using the novel techniques, we also develop a new index for pattern matching.

翻译：后缀数组和后缀树是用于字符串处理的两个最基本的数据结构。然而，对于长度为 $n$ 的文本，它们使用 $\Theta(n \log n)$ 位的空间，这通常过于昂贵。为了解决这个问题，Grossi 和 Vitter [STOC 2000] 以及 Ferragina 和 Manzini [FOCS 2000] 独立地引入了后缀数组的空间高效版本，称为压缩后缀数组(CSA)和 FM 索引。随后，Sadakane [SODA 2002] 提出了如何扩展这些索引以获得压缩后缀树(CST)。对于一个大小为 $\sigma$ 的字母表的长度为 $n$ 的文本，这些结构只使用 $O(n \log \sigma)$ 位。最大的未解决问题是它们能够以多快速的速度构建。经过二十年，最快的算法仍然在 $O(n)$ 时间内运行 [Hon等人，FOCS 2003]，这与 $\Omega(n/\log_{\sigma}n)$ 的下界相距 $\Theta(\log_{\sigma} n)$。在这篇论文中，我们提出了一种新的压缩后缀数组和压缩后缀树，它们可以在匹配原始 CSA/CST 和 FM 索引的空间限制和查询时间的前提下接受 $o(n)$-时间构建算法。更具体地说，我们的结构需要 $O(n \log \sigma)$ 位，支持 SA 查询和完整的后缀树功能，每次操作的时间复杂度为 $O(\log^{\epsilon}n)$，可以在使用 $O(n\log\sigma)$ 位的工作空间中在 $O(n \min(1,\log\sigma/\sqrt{\log n}))$ 时间内构建。我们将这个结果推导为一个更为一般性的规约：将所有压缩后缀数组/树的参数（查询时间、空间、构建时间和构建工作空间）基本上缩减到了一个称为前缀等级和前缀选择的新的查询类型的参数。使用新技术，我们还开发了一种用于模式匹配的新索引。