The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows--Wheeler transform (RLBWT), which is a run-length encoded BWT as a reversible permutation of an input string on the lexicographical order of suffixes. State-of-the-art construction algorithms of RLBWT have a serious issue with respect to (i) non-optimal computation time or (ii) a working space that is linearly proportional to the length of an input string. In this paper, we present \emph{r-comp}, the first optimal-time construction algorithm of RLBWT in BWT-runs bounded space. That is, the computational complexity of r-comp is $O(n + r \log{r})$ time and $O(r\log{n})$ bits of working space for the length $n$ of an input string and the number $r$ of equal-letter runs in BWT. The computation time is optimal (i.e., $O(n)$) for strings with the property $r=O(n/\log{n})$, which holds for most highly repetitive strings. Experiments using a real-world dataset of highly repetitive strings show the effectiveness of r-comp with respect to computation time and space.
翻译:压缩高度重复的字符串( 即, 多重复的字符串) 一直是字符串处理的一个中心研究主题, 并且到目前为止已经为这些字符串提出了相当几种压缩方法。 其中, 一种收集越来越多的关注的高效压缩格式是运行长的 Burrows- Wheeler 变形 (RLBWT), 这是运行长的编码BWT, 以可逆的方式转换后缀的输入字符串。 RBWT 的状态构造算法在( i) 非优化计算时间或(ii) 与输入字符串长度成直线成比例的办公空间。 在本文中, 我们展示了 emph{r- compt 的运行时间算法, 也就是 r- rxx 的计算复杂性是 $( n + r\ log{r} 和 $ $( rlog} 美元) 和 $n 最优化的 时间- 运行量的WTWTER 和 美元 的运行速度。