Text indexing is a classical algorithmic problem that has been studied for over four decades: given a text $T$, pre-process it off-line so that, later, we can quickly count and locate the occurrences of any string (the query pattern) in $T$ in time proportional to the query's length. The earliest optimal-time solution to the problem, the suffix tree, dates back to 1973 and requires up to two orders of magnitude more space than the plain text just to be stored. In the year 2000, two breakthrough works showed that efficient queries can be achieved without this space overhead: a fast index be stored in a space proportional to the text's entropy. These contributions had an enormous impact in bioinformatics: nowadays, virtually any DNA aligner employs compressed indexes. Recent trends considered more powerful compression schemes (dictionary compressors) and generalizations of the problem to labeled graphs: after all, texts can be viewed as labeled directed paths. In turn, since finite state automata can be considered as a particular case of labeled graphs, these findings created a bridge between the fields of compressed indexing and regular language theory, ultimately allowing to index regular languages and promising to shed new light on problems such as regular expression matching. This survey is a gentle introduction to the main landmarks of the fascinating journey that took us from suffix trees to today's compressed indexes for labeled graphs and regular languages.
翻译:文本索引是一个典型的算法问题, 已经研究了40多年了 : 给一个文本 $T$, 预处理它脱线, 以便我们可以在以后快速计算和定位任何字符串( 查询模式) 发生的时间与查询长度成正比 $T 。 最早最优化的压缩方案( 后缀树) 追溯到1973年, 并且需要比刚刚存储的普通文本多两个数量级的空间。 2000年, 两项突破性工程显示, 没有这个空间管理, 就能实现高效查询 : 快速索引存储在与文本的导线成比例的空间中 。 这些结果在生物信息学中产生了巨大影响: 如今, 几乎所有的DNA校对器都使用压缩的索引。 最近的趋势认为, 最有力的压缩方案( 字典压缩的折叠加器), 以及问题的一般化为标签图表: 毕竟, 文本可以被视为标签直线路径的路径。 反过来说, 有限的自动图可以被视为一个特定的例子, 这些发现在常规语言的缩缩缩缩缩缩索引和直径直径图中, 使得常规的直径直径直径的直径直径直到直到直径直径直径直到直走向方向的图像的图像的图的路径的路径的图 。