The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-$n$ text uses $\Theta(n \log n)$ bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-$n$ text over an alphabet of size $\sigma$, these data structures use only $O(n \log \sigma)$ bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for $\sigma=2$, the fastest algorithm remains stuck at $O(n)$ time [Hon et al., FOCS 2003], which is slower by a $\Theta(\log n)$ factor than the lower bound of $\Omega(n / \log n)$ (following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes $O(n \log \sigma)$ bits, answers suffix array queries in $O(\log^{\epsilon} n)$ time, and can be constructed in $O(n\log \sigma / \sqrt{\log n})$ time using $O(n\log \sigma)$ bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using $O(n \log \sigma)$ bits).
翻译:后缀阵列, 描述给定文本后缀的缩略图顺序, 是字符串运算法中的中央数据结构。 长- 美元文本的缩略图阵列使用$\ Theta (n\log n) 美元, 在许多应用程序中, 这会令人望而却步。 要解决这个问题, 格罗西 和 Vitter [STOC 2000], 独立地, Ferragina 和 Manzini [FOCS 2000] 引入了空间高效版本的后缀阵列, 称为压缩后缀阵列( CSA) 和调解析数据结构中。 在一个大小的缩略数中, 以美元计的文本阵列数 。 这些数据结构只使用 $(n\ log nn\ gigma) 美元 。 在实际应用程序中, 它们几乎完全取代了纯的 缩略图阵列 程序 。 然而, 20多年后, 即使是美元, 新的算法仍然以 $O (n) 来, 来, 新的解算算算 新的 。