未贴标签的进化树木和等级分级集群树木统计摘要 (Statistical summaries of unlabelled evolutionary trees and ranked hierarchical clustering trees)

Rooted and ranked binary trees are mathematical objects of great importance used to model hierarchical data and evolutionary relationships with applications in many fields including evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explore the posterior distribution of trees via Markov Chain Monte Carlo methods, however assessing uncertainty and summarizing distributions or samples of such trees remains challenging. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees which are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies (equipped with branch lengths) to define the Frechet mean and variance as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Frechet mean from a sample of or distribution on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020.

翻译：树根和排位二进制树是用于模拟等级数据和进化与许多领域应用(包括进化生物学和遗传流行病学)的关系的非常重要的数学对象。贝亚植物遗传推论通常通过Markov 链子蒙特卡洛方法探索树木的后部分布,然而,评估不确定性和总结这些树木的分布或样本仍然具有挑战性。虽然对有标签的植物遗传树进行了广泛研究,但对于无标签树木而言,文献相对较少,这些树木越来越有用,例如,当人们试图总结以不同方法或不同样本和环境获得的树木样本和进化关系,并希望评估这些摘要的稳定性和可概括性。在我们的论文中,我们最近利用了未贴标签的二进制树和无标签的分级基因圈(配有分支长度的)的远度测量标准,以确定Frechetcht 平均值和差异,作为这些树分布的概要。我们提供了一种高效的组合优化算法,用于计算从未贴标签的树形形状和未标分级排列的树木形状和无标签排列的基因组。我们展示了在2020年不同地区研究流行性树分布期间的CO-C-CO-S-V期间,用于比较2020年不同树分布分布分布分布分布分布的流行分布的概。我们展示的概要统计数据的可适用性统计。