Ancestral sequence reconstruction is a key task in computational biology. It consists in inferring a molecular sequence at an ancestral species of a known phylogeny, given descendant sequences at the tip of the tree. In addition to its many biological applications, it has played a key role in elucidating the statistical performance of phylogeny estimation methods. Here we establish a formal connection to another important bioinformatics problem, multiple sequence alignment, where one attempts to best align a collection of molecular sequences under some mismatch penalty score by inserting gaps. Our result is counter-intuitive: we show that perfect pairwise sequence alignment with high probability is possible in principle at arbitrary large evolutionary distances - provided the phylogeny is known and dense enough. We use techniques from ancestral sequence reconstruction in the taxon-rich setting together with the probabilistic analysis of sequence evolution models involving insertions and deletions.
翻译:世系序列的重建是计算生物学中的一项关键任务。 它包括从树顶端的后代序列中推断出已知的植物遗传祖传物种的分子序列。 除了许多生物应用外,它还在解释植物遗传估计方法的统计性能方面发挥了关键作用。 我们在这里建立了与另一个重要的生物信息学问题的正式联系, 即多个序列对齐, 试图通过插入空白, 将分子序列的集合与某种不匹配罚分下的一些分子序列相匹配。 我们的结果是反直觉的: 我们显示,在任意的大型进化距离中,在任意的大规模进化距离中,极有可能实现高度概率的完美对齐序列对齐---- 只要对植物遗传特性的了解和密度足够大。 我们使用在富集的税种中进行祖传序列重组的技术, 以及包含插入和删除的序列演进模型的概率分析。