Real-world data often comes in compressed form. Analyzing compressed data directly (without decompressing it) can save space and time by orders of magnitude. In this work, we focus on fundamental sequence comparison problems and try to quantify the gain in time complexity when the underlying data is highly compressible. We consider grammar compression, which unifies many practically relevant compression schemes. For two strings of total length $N$ and total compressed size $n$, it is known that the edit distance and a longest common subsequence (LCS) can be computed exactly in time $\tilde{O}(nN)$, as opposed to $O(N^2)$ for the uncompressed setting. Many applications need to align multiple sequences simultaneously, and the fastest known exact algorithms for median edit distance and LCS of $k$ strings run in $O(N^k)$ time. This naturally raises the question of whether compression can help to reduce the running time significantly for $k \geq 3$, perhaps to $O(N^{k/2}n^{k/2})$ or $O(Nn^{k-1})$. Unfortunately, we show lower bounds that rule out any improvement beyond $\Omega(N^{k-1}n)$ time for any of these problems assuming the Strong Exponential Time Hypothesis. At the same time, we show that approximation and compression together can be surprisingly effective. We develop an $\tilde{O}(N^{k/2}n^{k/2})$-time FPTAS for the median edit distance of $k$ sequences. In comparison, no $O(N^{k-\Omega(1)})$-time PTAS is known for the median edit distance problem in the uncompressed setting. For two strings, we get an $\tilde{O}(N^{2/3}n^{4/3})$-time FPTAS for both edit distance and LCS. In contrast, for uncompressed strings, there is not even a subquadratic algorithm for LCS that has less than a polynomial gap in the approximation factor. Building on the insight from our approximation algorithms, we also obtain results for many distance measures including the edit, Hamming, and shift distances.
翻译:真实世界数据通常以压缩形式出现。 分析压缩数据可以直接( 不压缩它) 节省空间和时间。 在这项工作中, 我们侧重于基本序列比较问题, 当基础数据高度压缩时试图量化时间复杂性的增加。 我们考虑语法压缩, 它可以统一许多实际相关的压缩方案。 对于总长度为N美元和总压缩规模为美元的两个字符串来说, 人们知道, 编辑距离和最长的共同后继( LCS) 可以完全按时间来计算 $( tilde{O} (nN) 来节省空间和时间 。 与 远距离设置的 $( N) 比较问题。 许多应用程序需要同时对多个序列进行匹配, 而已知的中位速度精确算法在美元( N) 美元时间里运行。 这自然会引发一个问题: 对于美元来说, 3美元, 可能以美元来大幅减少运行时间, 美元( O) (nk) (n) (n) 空) (n) (n) 平比 远地(n) 美元) 。