在建造压缩后缀阵列中打破 $O(n)$- barrier (Breaking the $O(n)$-Barrier in the Construction of Compressed Suffix Arrays)

The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-$n$ text uses $\Theta(n \log n)$ bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-$n$ text over an alphabet of size $\sigma$, these data structures use only $O(n \log \sigma)$ bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for $\sigma=2$, the fastest algorithm remains stuck at $O(n)$ time [Hon et al., FOCS 2003], which is slower by a $\Theta(\log n)$ factor than the lower bound of $\Omega(n / \log n)$ (following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes $O(n \log \sigma)$ bits, answers suffix array queries in $O(\log^{\epsilon} n)$ time, and can be constructed in $O(n\log \sigma / \sqrt{\log n})$ time using $O(n\log \sigma)$ bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using $O(n \log \sigma)$ bits).

翻译：后缀阵列, 描述给定文本后缀的缩略图顺序, 是字符串运算法中的中央数据结构。长- 美元文本的缩略图阵列使用$\ Theta (n\log n) 美元, 在许多应用程序中, 这会令人望而却步。要解决这个问题, 格罗西和 Vitter [STOC 2000], 独立地, Ferragina 和 Manzini [FOCS 2000] 引入了空间高效版本的后缀阵列, 称为压缩后缀阵列( CSA) 和调解析数据结构中。在一个大小的缩略数中, 以美元计的文本阵列数。这些数据结构只使用 $(n\ log nn\ gigma) 美元。在实际应用程序中, 它们几乎完全取代了纯的缩略图阵列程序。然而, 20多年后, 即使是美元, 新的算法仍然以 $O (n) 来, 来, 新的解算算算新的。

相关内容

FOCS

关注 0

IEEE计算机科学基础研讨会（FOCS）是由IEEE计算机学会计算数学基础技术委员会（TCMF）主办的旗舰会议，涵盖了广泛的理论计算机科学。它每年秋季举行，并与每年春季举行的由ACM SIGACT赞助的姊妹会议——计算理论年度研讨会（STOC）配对。官网链接：http://ieee-focs.org/