It is known that the exact form of the Burrows-Wheeler-Transform (BWT) of a string collection depends, in most implementations, on the input order of the strings in the collection. Reordering strings of an input collection affects the number of equal-letter runs $r$, arguably the most important parameter of BWT-based data structures, such as the FM-index or the $r$-index. Bentley, Gibney, and Thankachan [ESA 2020] introduced a linear-time algorithm for computing the permutation of the input collection which yields the minimum number of runs of the resulting BWT. In this paper, we present the first tool that guarantees a Burrows-Wheeler-Transform with minimum number of runs (optBWT), by combining i) an algorithm that builds the BWT from a string collection (either SAIS-based [Cenzato et al., SPIRE 2021] or BCR [Bauer et al., CPM 2011]); ii) the SAP array data structure introduced in [Cox et al., Bioinformatics, 2012]; and iii) the algorithm by Bentley et al. We present results both on real-life and simulated data, showing that the improvement achieved in terms of $r$ with respect to the input order is significant and the overhead created by the computation of the optimal BWT negligible, making our tool competitive with other tools for BWT-computation in terms of running time and space usage. In particular, on real data the optBWT obtains up to 31 times fewer runs with only a $1.39\times$ slowdown. Source code is available at https://github.com/davidecenzato/optimalBWT.git.
翻译:已知的是,字符串收藏的 Burrows- Wheeler- Transform (BWT) 的确切形式在多数执行中取决于收藏中字符串的输入顺序。 输入收藏的重新排序字符串会影响等量字母运行美元的数量, 可以说是基于 BWT 的数据结构的最重要参数, 如调频指数或美元指数。 Bentley、 Gibney 和 Exchangan [ESA 2020] 引入了一个线性时间算法, 用于计算输入收藏的调整, 从而得出由此生成的 BWT 运行量的最小数量。 在本文件中, 我们展示第一个工具保证 Burrows- Wheeler- Transformation 与最小运行量运行量运行量(optBWTWT) 运行量运行量(OptBWT), 将BWT的算法从字符串收集(基于 SAIS [ Cenzato et al., SPIR SPER 2021 或 BCR [Bueral, etal, lifrial dal dal dal dal dal deal deal deal) ral deal deal deals) 开始, 在目前和Brald dal deal deal deald dismal deals remas 和B- dals remad 和B- sals remad d remax 上, 在目前Brals remaild Stald Stald Staltaltald Stald Stald Stald Staldald dreald dreald dreals 上, 在目前和B- saldaldals 和B- salsalsalalsalsalsals 上, 在目前 和Balsaldalalalalaldaldaldaldaldalsalsalsalals 上, 在目前 上, 上, 在目前的SAPSAP 上, 在目前Bsaldalsalsalsalsalsldaldaldaldaldaldals。