The run-length compressed Burrows-Wheeler transform (RLBWT) used in conjunction with the backward search introduced in the FM index is the centerpiece of most compressed indexes working on highly-repetitive data sets like biological sequences. Compared to grammar indexes, the size of the RLBWT is often much bigger, but queries like counting the occurrences of long patterns can be done much faster than on any existing grammar index so far. In this paper, we combine the virtues of a grammar with the RLBWT by building the RLBWT on top of a special grammar based on induced suffix sorting. Our experiments reveal that our hybrid approach outperforms the classic RLBWT with respect to the index sizes, and with respect to query times on biological data sets for sufficiently long patterns.
翻译:与调频索引中引入的后向搜索一起使用的长长压缩Burrows-Wheeler变形(RLBWT)是大多数最压缩的指数的中心,它们围绕生物序列等高重复性数据集工作。与语法索引相比,RLBWT的大小往往大得多,但像计算长型数字一样的查询可以比迄今为止任何现有的语法索引快得多。在本文中,我们把语法的优点与RLBWT结合起来,在基于诱导后缀排序的特殊语法之上建立RLBWT。我们的实验表明,我们的混合方法在指数大小方面超过了经典的RLBWT,而在生物数据集的查询时间方面则足够长。