Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.
翻译:Mantaci et al. [TCS 2007] 定义了eBWT, 将BWT的定义扩大到字符串的集合,然而,自本导言以来,它被更普遍地用来描述任何BWT的字符串集合和原始定义的基本属性(即不受输入顺序的限制)经常被忽略。在本文件中,我们为最初eBWT的构建提出了一个简单的线性时间算法,而最初eBBWT不需要Bannai et al.[CPPM 2021]。作为副产品,我们获得了计算BWT的单个字符串的首次线性时间算法,而BWT既不使用断符号,也不使用Lyndon的旋转。我们的新eBWT的构造与不使用前缀的拼法的变异异,以便可以缩放 eBWT。我们评估了关于人类染色体19号、Salmoniella和SAS-CV2基因组的算法, 显示这是所有收藏的最快方法, 最高级的SBWT/CVx 样本是我们最先进的S 7.6的第二可获取的存储式。