The FM-index is an important data structure in combinatorial pattern matching and bioinformatics, that has been generalized from indexing single strings to indexing collections of strings, labelled trees, de Bruijn graphs and Wheeler graphs. To generalize the FM-index to collections of strings and to labelled trees, researchers generalized the Burrows-Wheeler Transform (BWT) to the Extended BWT (EBWT) and the eXtended BWT (XBWT), respectively. Although one of the EBWT's main applications is compressing and indexing DNA readsets, we show in this paper that when the reads have been assembled or when they align well to a reference genome, then it is possible to use that assembled or reference genome to produce a smaller compressed index. To do this, we graft the reads onto the genome and store the resulting labelled tree with the XBWT. For an {\it E.\ coli} readset, for example, our experiments show that eliminating separators characters from the EBWT reduces the number of runs by 16%, from 105.3 million to 88.3 million, and using the XBWT reduces it by a further 8.3%, to 80.9 million.
翻译:FM-index是组合式模式匹配和生物信息学中的一个重要数据结构,从单字符索引化到字符串、有标签的树木、de Bruijn 图形和Wheeler 图形的索引收集,这在分类模式匹配和生物信息学中是一个普遍化的重要数据结构。为了将调频指数化为字符串和有标签的树木的收集,研究人员将Burrows-Wheeler变型(BWT)分别推广到扩展的BWT(EBWT)和extied BWT(XBWT),研究人员将Burrows-Wheeler变型(BWT)和exitedd BWT(XBWT)分别推广到扩展版。尽管欧洲BWT的主要应用之一是压缩和索引化DNA读取,但我们在本文中显示,当内容组装组装成或它们与参考基因组一致时,就可以使用该组或参照基因组生成一个较小的压缩指数。要做到这一点,我们把读到基因组的读起来的读取结果并储存成 。例如我们的实验显示,消除欧洲BWTTT的分隔字符字符字符字符字符字符的减少16%,从10.53万至88.3万至8.3百万再减少。