DNA sequencing, especially of microbial genomes and metagenomes, has been at the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of sequence search on these terabyte and petabyte-scale datasets, including raw reads and assembled genomes. However, no known implementation provides both fast query and construction time, keeps the low false-positive requirement, and offers cheap storage of the data structure. We propose a data structure for search called RAMBO (Repeated And Merged BloOm Filter) which is significantly faster in query time than state-of-the-art genome indexing methods- COBS (Compact bit-sliced signature index), Sequence Bloom Trees, HowDeSBT, and SSBT. Furthermore, it supports insertion and query process parallelism, cheap updates for streaming inputs, has a zero false-negative rate, a low false-positive rate, and a small index size. RAMBO converts the search problem into set membership testing among $K$ documents. Interestingly, it is a count-min sketch type arrangement of a membership testing utility (Bloom Filter in our case). The simplicity of the algorithm and embarrassingly parallel architecture allows us to stream and index a 170TB whole-genome sequence dataset in a mere 9 hours on a cluster of 100 nodes while competing methods require weeks.
翻译:DNA测序,特别是微生物基因组和基因组的DNA测序,一直是最近大规模比较基因组研究进展的核心,而数据流则成为了最近大规模比较基因组研究进展的核心。数据流在过去几年中导致基因组数据集的指数增长,没有减缓的迹象。最近几次尝试将测序的计算负担,包括原始读数和组装基因组。然而,没有已知的实施既提供了快速查询和计算时间,也保持了较低的假阳性要求,并且提供了数据结构的廉价储存。我们提议了一个数据结构,用于称为RAMBO(重现和合并的Bloom过滤器)的搜索,在查询时间里比最先进的基因组索引方法-COBSB(复合点授权签名索引指数)、序列布线树、SHOSBT和SSBT的测序。此外,它支持插入和查询过程的平行,对流动输入的廉价更新,具有零假反偏差率、低正率率率和合并的BO(RM),同时将一个最先进的数据序列序列结构测试,而一个更精确的SB类的SB类的系统测试需要一个小的系统。