The Longest Common Subsequence (LCS) is a fundamental string similarity measure, and computing the LCS of two strings is a classic algorithms question. A textbook dynamic programming algorithm gives an exact algorithm in quadratic time, and this is essentially best possible under plausible fine-grained complexity assumptions, so a natural problem is to find faster approximation algorithms. When the inputs are two binary strings, there is a simple $\frac{1}{2}$-approximation in linear time: compute the longest common all-0s or all-1s subsequence. It has been open whether a better approximation is possible even in truly subquadratic time. Rubinstein and Song showed that the answer is yes under the assumption that the two input strings have equal lengths. We settle the question, generalizing their result to unequal length strings, proving that, for any $\varepsilon>0$, there exists $\delta>0$ and a $(\frac{1}{2}+\delta)$-approximation algorithm for binary LCS that runs in $n^{1+\varepsilon}$ time. As a consequence of our result and a result of Akmal and Vassilevska-Williams, for any $\varepsilon>0$, there exists a $(\frac{1}{q}+\delta)$-approximation for LCS over $q$-ary strings in $n^{1+\varepsilon}$ time. Our techniques build on the recent work of Guruswami, He, and Li who proved new bounds for error-correcting codes tolerating deletion errors. They prove a combinatorial "structure lemma" for strings which classifies them according to their oscillation patterns. We prove and use an algorithmic generalization of this structure lemma, which may be of independent interest.
翻译:长期常见子序列( LCS) 是一个基本的字符串相似度量, 计算两个字符串的 LCS 是经典的算法问题 。 教科书动态编程算法在二次曲线时间里提供精确的算法, 这基本上是最好的, 在看起来精细的复杂度假设下, 所以自然的问题是找到更快的近似算法 。 当输入是两个二进制字符串时, 一个简单的$\ frac{ 1\\\ 2} 美元在线性时间里使用 : 计算最长的通用全值或全部一子的 LCS 。 它一直开放, 即使在真正的二次曲线时间里, 教科书里文斯坦和宋显示答案是肯定的。 我们解决问题, 将其结果概括化为两个二进制长度, 证明对于任何美元 美元 的 delta>0 美元, 和 美元 美元里文里文 里文 里萨洛的 里程里程里程里程里 。