Phylogenetic and discrete-trait evolutionary inference depend heavily on appropriate characterization of the underlying substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of both sampling-based (Bayesian inference via HMC) and maximization-based inference (MAP estimation) under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is more adequate than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. On a dataset of 28 taxa spanning the Metazoa, a random-effects amino acid substitution model finds evidence of notable departures from the current best-fit amino acid model in seconds. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.
翻译:进化树学和离散特征进化推断对下位置换过程的适当描述有很大依赖。本文提出随机效应置换模型,将常见的连续时间 Markov 链模型扩展到可以捕捉更多置换动态的更丰富的过程类别。由于这些随机效应置换模型通常需要比常规模型更多的参数,因此推断可能会在统计和计算上面临挑战。因此,我们还提出了一种有效的方法来计算数据似然函数相对于所有未知的置换模型参数的梯度逼近。我们证明了这种近似梯度可以在大量的树和状态空间下,实现随机效应置换模型的采样和最大后验估计。应用于一个由 583 个 SARS-CoV-2 序列组成的数据集中,随机效应 HKY 模型显示出非可逆的置换过程信号,后验预测模型检验明确表明它比可逆模型更为充分。在分析 1441 个流感 A 病毒(H3N2)序列在14个地区之间的生物地理扩散模式时,随机效应进化地理置换模型推断:空气旅行量可以很好地预测几乎所有的扩散率。在树蛙亚科 Hylinae 中,随机效应状态相关置换模型没有发现树栖性对游泳方式的影响。在涵盖Metazoa门的28个群体的一个数据集中,随机效应氨基酸置换模型发现有明显不同于当前最佳拟合氨基酸模型的候选方案。我们表明,我们的基于梯度的推断方法在时间效率上比传统方法高至少一个数量级。