索引高重复性字符串收藏 (Indexing Highly Repetitive String Collections)

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field.

翻译：20年前,在弦收藏的索引化方面有了突破,使得能够在压缩空间内代表它们,同时提供索引搜索功能。由于这种新技术渗透到生物信息学等应用中,因此弦收藏的增长速度超过了摩尔的法律,并挑战我们甚至以压缩形式处理它们的能力。幸运的是,许多迅速增长的弦收藏高度重复,因此其信息内容数量比其普通大小要小。但是,用于古典收藏的统计压缩方法却对这种重复性视而不见,因此开发了一套新的技术,以便适当地加以利用。由此形成的索引形成了新一代的数据结构,能够处理我们现在面临的巨大的重复的弦收藏。在这次调查中,我们涵盖了导致这些数据结构的算法发展。我们描述了用来利用重复性、构成所有现有索引基础的基本算法概念的独特压缩模式,以及所提出的各种结构,在理论和实践两个方面进行比较。我们得出了这个惊人领域目前的挑战。