接近 LCS 和多个序列的对齐距离 (Approximating LCS and Alignment Distance over Multiple Sequences)

We study the problem of aligning multiple sequences with the goal of finding an alignment that either maximizes the number of aligned symbols (the longest common subsequence (LCS)), or minimizes the number of unaligned symbols (the alignment distance (AD)). Multiple sequence alignment is a well-studied problem in bioinformatics and is used to identify regions of similarity among DNA, RNA, or protein sequences to detect functional, structural, or evolutionary relationships among them. It is known that exact computation of LCS or AD of $m$ sequences each of length $n$ requires $\Theta(n^m)$ time unless the Strong Exponential Time Hypothesis is false. In this paper, we provide several results to approximate LCS and AD of multiple sequences. If the LCS of $m$ sequences each of length $n$ is $\lambda n$ for some $\lambda \in [0,1]$, then in $\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+1})$ time, we can return a common subsequence of length at least $\frac{\lambda^2 n}{2+\epsilon}$ for any arbitrary constant $\epsilon >0$. It is possible to approximate the AD within a factor of two in time $\tilde{O}_m(n^{\lceil\frac{m}{2}\rceil})$. However, going below-2 approximation requires breaking the triangle inequality barrier which is a major challenge in this area. No such algorithm with a running time of $O(n^{\alpha m})$ for any $\alpha < 1$ is known. If the AD is $\theta n$, then we design an algorithm that approximates the AD within an approximation factor of $\left(2-\frac{3\theta}{16}+\epsilon\right)$ in $\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2})$ time. Thus, if $\theta$ is a constant, we get a below-two approximation in $\tilde{O}_m(n^{\lfloor\frac{m}{2}\rfloor+2})$ time. Moreover, we show if just one out of $m$ sequences is $(p,B)$-pseudorandom then, we get a below-2 approximation in $\tilde{O}_m(nB^{m-1}+n^{\lfloor \frac{m}{2}\rfloor+3})$ time irrespective of $\theta$.

翻译：我们研究将多个序列与以下目标匹配的问题: 找到一个匹配的符号数量( 最常用的次序列$ (LCS) ) 或最小化不匹配的符号数量( 校对距离 ) 。在生物信息学中, 多序列匹配是一个研究周密的问题, 用来识别DNA、 RNA 或蛋白序列之间的相似区域, 以检测它们之间的功能性、结构性或进化关系。众所周知, 准确计算LCS 或AD每长1美元的序列需要 $( 美元美元 ), 除非强烈的上市性时间( 美元美元 ), 否则需要美元。如果LCS的序列每长为 $, 或蛋白序列为美元, 以美元美元, 以美元美元, 以美元美元。以美元美元以美元以美元美元, 以美元美元以以美元以美元。以美元以美元以以美元以以以美元以以以以以以美元以以以美元以以以以以美元以以以以以以以以以以以以美元以以以美元以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以美元以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以以