Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multistring generalization of the Burrows-Wheeler Transform (BWT): large requirements of in-memory approaches have stimulated recent developments on external memory algorithms. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is instrumental to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In a previous paper, we presented an in-memory divide-and-conquer method for building the BWT and LCP where we merge partial BWTs with a forward approach to sort suffixes. In this paper, we propose an alternative backward strategy to develop an external memory method to simultaneously build the BWT and the LCP array on a collection of m strings of different lengths. The algorithm over a set of strings having constant length k has O(mkl) time and I/O volume, using O(k + m) main memory, where l is the maximum value in the LCP array.
翻译:将非常庞大的字符串收集成索引,例如由广泛的下一代测序技术生成的字符串收集,严重依赖Burrows-Wheeler变换(BWT)的多弦化概括化:大量模拟方法的要求刺激了外部记忆算法的近期发展。计算一组字符串中最长的常见前缀(LCP)阵列的相关问题有助于计算字符串之间的后缀-前缀重叠,这是许多基因组组组组算法的一个基本步骤。在上一份论文中,我们展示了一种构建 BWT 和 LCP 的模擬分解法。 我们用前方方法将部分 BWT 和 后缀法合并了。 在本文中,我们提出了另一种后向战略,以开发一种外部记忆方法,同时在一系列不同长度的 mrings 上建立 BWT 和 LCP 阵列。 一组字符串的算法有O( mkl) 时间和 I/O 卷, 使用 O (k + m) 主内存, 其中I 是 LCP 阵列中的最大值 。