The Burrows-Wheeler-Transform (BWT) is a reversible string transformation which plays a central role in text compression and is fundamental in many modern bioinformatics applications. The BWT is a permutation of the characters, which is in general better compressible and allows to answer several different query types more efficiently than the original string. It is easy to see that not every string is a BWT image, and exact characterizations of BWT images are known. We investigate a related combinatorial question. In many applications, a sentinel character dollar is added to mark the end of the string, and thus the BWT of a string ending with dollar contains exactly one dollar-character. Given a string w, we ask in which positions, if any, the dollar-character can be inserted to turn w into the BWT image of a word ending with dollar. We show that this depends only on the standard permutation of w and present a O(n log n)-time algorithm for identifying all such positions, improving on the naive quadratic time algorithm. We also give a combinatorial characterization of such positions and develop bounds on their number and value. This is an extended version of [Giuliani et al. ICTCS 2019].
翻译:Burrows- Wheeler- Transform (BWT) 是一种可逆转的字符串转换, 它在文本压缩中起着核心作用, 在许多现代生物信息应用中具有根本意义。 BWT 是字符的变换, 一般来说, 它比原始字符串更能压缩, 并允许以比原始字符串更高效的方式回答不同的查询类型。 很容易看到, 并不是每个字符串都是 BWT 图像, 而且 BWT 图像的精确描述是已知的。 我们调查了一个相关的组合问题。 在许多应用程序中, 发送的字符字符字符元美元将添加到字符串的结尾, 因此, 以美元结尾的字符串的 BWT 包含一个精确的美元字典。 我们从一个字符串 w 中询问, 如果有的话, 美元字符串可以插入到多个不同的查询类型。 我们显示, 这仅取决于 w 的标准变换, 并展示一个 O(nlog n)- 时间算法, 用于识别所有这些位置, 改进其天性刻度时间算。 我们还给出了这种格式的 CSFCSDL 和SULA 。 。 。 。