In Gapped String Indexing, the goal is to compactly represent a string $S$ of length $n$ such that given queries consisting of two strings $P_1$ and $P_2$, called patterns, and an integer interval $[\alpha, \beta]$, called gap range, we can quickly find occurrences of $P_1$ and $P_2$ in $S$ with distance in $[\alpha, \beta]$. Due to the many applications of this fundamental problem in computational biology and elsewhere, there is a great body of work for restricted or parameterised variants of the problem. However, for the general problem statement, no improvements upon the trivial $\mathcal{O}(n)$-space $\mathcal{O}(n)$-query time or $\Omega(n^2)$-space $\mathcal{\tilde{O}}(|P_1| + |P_2| + \mathrm{occ})$-query time solutions were known so far. We break this barrier obtaining interesting trade-offs with polynomially subquadratic space and polynomially sublinear query time. In particular, we show that, for every $0\leq \delta \leq 1$, there is a data structure for Gapped String Indexing with either $\mathcal{\tilde{O}}(n^{2-\delta/3})$ or $\mathcal{\tilde{O}}(n^{3-2\delta})$ space and $\mathcal{\tilde{O}}(|P_1| + |P_2| + n^{\delta}\cdot (\mathrm{occ}+1))$ query time, where $\mathrm{occ}$ is the number of reported occurrences. As a new fundamental tool towards obtaining our main result, we introduce the Shifted Set Intersection problem: preprocess a collection of sets $S_1, \ldots, S_k$ of integers such that given queries consisting of three integers $i,j,s$, we can quickly output YES if and only if there exist $a \in S_i$ and $b \in S_j$ with $a+s = b$. We start by showing that the Shifted Set Intersection problem is equivalent to the indexing variant of 3SUM (3SUM Indexing) [Golovnev et al., STOC 2020]. Via several steps of reduction we then show that the Gapped String Indexing problem reduces to polylogarithmically many instances of the Shifted Set Intersection problem.
翻译:在 Gaped Streating 指数中, 目标是缩略地代表一个由两个字符串 $_S2 美元和美元2美元组成的查询, 被称为模式, 以及一个整数间隔 $$\ ALpha,\beta$, 被称为差距范围, 我们很快就能找到美元1美元和2美元, 距离在 $[ ALpha,\beta] 之前。 由于计算生物学和其他方面这个基本问题的多种应用, 问题有限制或参数变量的庞大工作。 然而, 对于一般问题说明, 微小的 $\\ 美元, 美元 - 空间 $\ 美元\ 美元\\ 美元\ 美元, 美元 美元 - 美元 美元 =\\ \\\ \\ \ \ \ \ \ \ \ \ \ \ \ \ 美元 美元, 美元, 问题在计算中, 美元 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \