Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database.
翻译:在生物信息学中,计算与文本相对应的模式统计是一项基本任务,但当文本是一个高度压缩的基因组数据库时,则是一项艰巨的任务。Bannai等人为这个案例提供了高效的解决方案,Rossi等人最近实施了这一解决方案,但它在第一个版本中对每个字符使用两次通过的模式和缓冲指示器。在本文件中,我们简化了它们的解决方案并使其流,成本是稍微放慢速度。这意味着,首先,我们可以平行地计算若干长期模式(如整个人类染色体)的匹配统计数据,同时仍然使用合理数量的内存;其次,我们可以对在线统计数据进行低延迟的匹配,从而在一种模式与数据库相比变得不可压缩时迅速识别。