The $r$-index (Gagie et al., JACM 2020) represented a breakthrough in compressed indexing of repetitive text collections, outperforming its alternatives by orders of magnitude. Its space usage, $\mathcal{O}(r)$ where $r$ is the number of runs in the Burrows-Wheeler Transform of the text, is however larger than Lempel-Ziv and grammar-based indexes, and makes it uninteresting in various real-life scenarios of milder repetitiveness. In this paper we introduce the $sr$-index, a variant that limits the space to $\mathcal{O}(\min(r,n/s))$ for a text of length $n$ and a given parameter $s$, at the expense of multiplying by $s$ the time per occurrence reported. The $sr$-index is obtained by carefully subsampling the text positions indexed by the $r$-index, in a way that we prove is still able to support pattern matching with guaranteed performance. Our experiments demonstrate that the $sr$-index sharply outperforms virtually every other compressed index on repetitive texts, both in time and space, even matching the performance of the $r$-index while using 1.5--3.0 times less space. Only some Lempel-Ziv-based indexes achieve better compression than the $sr$-index, using about half the space, but they are an order of magnitude slower.
翻译:$- index (Gagie et al., JACM 2020) 是压缩重复文本收藏的索引的突破, 比数量级的替代值高得多。 它的空间使用量, $mathcal{O}(r) 美元, 美元是文本Burrows- Wheeler 变换的运行量, 美元则大于 Burrows- Wheeler 版本的运行量, 但是, 美元大于 lempel- Ziv 和 gragramar 指数的运行量, 使得它不再关注各种基于 $- index 的更温和重复的真实生活情景。 在本文中, 我们引入了 $sr index, 将篇幅限制在 $\ mindexcal{( minal, r, n/n/s) $xxxx 的文本限制为$$ 。 我们的实验显示, 美元- 美元- lemexlix- deal- developal ex ex ex ex expressions express as the pressal- sessal- sal- exal- expressional- expressional- expressional- $ximpressional.