The multispecies coalescent process models the genealogical relationships of genes sampled from several species, enabling useful predictions about phenomena such as the discordance between the gene tree and the species phylogeny due to incomplete lineage sorting. Conversely, knowledge of large collections of gene trees can inform us about several aspects of the species phylogeny, such as its topology and ancestral population sizes. A fundamental open problem in this context is how to efficiently compute the probability of a gene tree topology, given the species phylogeny. Although a number of algorithms for this task have been proposed, they either produce approximate results, or, when they are exact, they do not scale to large data sets. In this paper, we present some progress towards exact and efficient computation of the probability of a gene tree topology. We provide a new algorithm that, given a species tree and the number of genes sampled for each species, calculates the probability that the gene tree topology will be concordant with the species tree. Moreover, we provide an algorithm that computes the probability of any specific gene tree topology concordant with the species tree. Both algorithms run in polynomial time and have been implemented in Python. Experiments show that they are able to analyse data sets where thousands of genes are sampled, in a matter of minutes to hours.
翻译:多个物种的荧光过程模型是从几个物种取样的基因的基因的基因关系,这些模型有助于对基因树和物种的植物种类之间由于不完全的种类分类而出现差异等现象作出有益的预测。相反,大量基因树的集合知识可以告诉我们物种植物种类的若干方面,例如其地形和祖传人口大小。这一背景下一个根本的开放问题是如何有效地计算基因树结构的概率,考虑到物种的植物特征。虽然已经为此提出了一些算法,但它们要么产生近似的结果,要么在精确的情况下,它们不与大型数据集相适应。在本文件中,我们介绍了在精确和有效地计算基因树形态学概率方面所取得的一些进展。我们提供了一种新的算法,考虑到物种的树木和采样的基因种类数量,我们计算基因树的形态形态学概率是否与物种相符。我们提供了一种算法,在任何特定的基因树的形态表层结构上,它们都能够用来计算成千位的基因的基因形态特征的概率。我们提供了一种算法,这些算法是用来分析这些物种的基因的实验时间序列。这两种算法都用来分析成成千个物种。