The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let $e_k(n)$ denote the average edit distance between random, independent strings of $n$ characters from an alphabet of size $k$. For $k \geq 2$, it is an open problem how to efficiently compute the exact value of $\alpha_{k}(n) = e_k(n)/n$ as well as of $\alpha_{k} = \lim_{n \to \infty} \alpha_{k}(n)$, a limit known to exist. This paper shows that $\alpha_k(n)-Q(n) \leq \alpha_k \leq \alpha_k(n)$, for a specific $Q(n)=\Theta(\sqrt{\log n / n})$, a result which implies that $\alpha_k$ is computable. The exact computation of $\alpha_k(n)$ is explored, leading to an algorithm running in time $T=\mathcal{O}(n^2k\min(3^n,k^n))$, a complexity that makes it of limited practical use. An analysis of statistical estimates is proposed, based on McDiarmid's inequality, showing how $\alpha_k(n)$ can be evaluated with good accuracy, high confidence level, and reasonable computation time, for values of $n$ say up to a quarter million. Correspondingly, 99.9\% confidence intervals of width approximately $10^{-2}$ are obtained for $\alpha_k$. Combinatorial arguments on edit scripts are exploited to analytically characterize an efficiently computable lower bound $\beta_k^*$ to $\alpha_k$, such that $ \lim_{k \to \infty} \beta_k^*=1$. In general, $\beta_k^* \leq \alpha_k \leq 1-1/k$; for $k$ greater than a few dozens, computing $\beta_k^*$ is much faster than generating good statistical estimates with confidence intervals of width $1-1/k-\beta_k^*$. The techniques developed in the paper yield improvements on most previously published numerical values as well as results for alphabet sizes and string lengths not reported before.
翻译:在计算生物学、语音识别和机器学习中广泛应用的字符间距差异度;让 $_k(n) 表示从一个大小的字母中随机独立的字符间距为$n美元。 对于 $k\geq 2, 如何有效计算 $\ k(n) = e_k(n) /n) 和 $\ k(h) 和 $(h) - 2美元 和 美元 的计算方法不同。让 美元 =2 (n) 表示 美元 = 美元 = 美元 = 美元 =k(k) 和 美元 = = 美元 = = 美元 = 美元 = 美元 = (h) = 美元 = = 美元 = 美元 =k) = 美元 = = 美元 = 美元 = = 美元 = = = 美元 = = 美元 = = 美元 = = = = = = 美元 = = = = = = 正在计算 = 美元 = = = = = = = = = = = = = = 美元 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =