The research on indexing repetitive string collections has focused on the same search problems used for regular string collections, though they can make little sense in this scenario. For example, the basic pattern matching query "list all the positions where pattern $P$ appears" can produce huge outputs when $P$ appears in an area shared by many documents. All those occurrences are essentially the same. In this paper we propose a new query that can be more appropriate in these collections, which we call {\em contextual pattern matching}. The basic query of this type gives, in addition to $P$, a context length $\ell$, and asks to report the occurrences of all {\em distinct} strings $XPY$, with $|X|=|Y|=\ell$. While this query is easily solved in optimal time and linear space, we focus on using space related to the repetitiveness of the text collection and present the first solution of this kind. Letting $\ovr$ be the maximum of the number of runs in the BWT of the text $T[1..n]$ and of its reverse, our structure uses $O(\ovr\log(n/\ovr))$ space and finds the $c$ contextual occurrences $XPY$ of $(P,\ell)$ in time $O(|P| + c \log n)$. We give other space/time tradeoffs as well, for compressed and uncompressed indexes.
翻译:重复字符串收藏的索引化研究侧重于用于定期字符串收藏的相同搜索问题,尽管在这种情景中它们没有什么意义。例如,基本模式匹配查询“列出所有出现模式$$的方位”的基本模式匹配查询“列出所有出现模式$$的方位”当美元出现在许多文件共享的区域内时可以产生巨大的产出。 所有这些事件基本上都是相同的。在本文件中,我们提议了新的查询,这些收藏可以更适合,我们称之为“背景模式匹配”。这种基本查询除了提供$P$外,还给出了上下文长度$@ell$,并要求用$X+++美元报告所有不同字符端字符$XPY$的发生情况。虽然这个查询很容易在最佳时间和线性空间中解决,但我们侧重于使用与文本收藏重复有关的空间,并提出这类类型的第一个解决方案。让$@ovr$成为文本中BWT$[1.n]和其反面结构将$O$(c\\$P\$_美元/美元)用于背景空间交易中的美元/美元。