Statistical inference in evolutionary models with site-dependence is a long-standing challenge in phylogenetics and computational biology. We consider the problem of approximating marginal sequence likelihoods under dependent-site models of biological sequence evolution. We prove a polynomial mixing time bound for a Markov chain Monte Carlo algorithm that samples the conditional distribution over latent sample paths, when the chain is initialized with a warm start. We then introduce a sequential Monte Carlo (SMC) algorithm for approximating the marginal likelihood, and show that our mixing time bound can be combined with recent importance sampling and finite-sample SMC results to obtain bounds on the finite sample approximation error of the resulting estimator. Our results show that the proposed SMC algorithm yields an efficient randomized approximation scheme for many practical problems of interest, and offers a significant improvement over a recently developed importance sampler for this problem. Our approach combines recent innovations in obtaining bounds for MCMC and SMC samplers, and may prove applicable to other problems of approximating marginal likelihoods and Bayes factors.
翻译:在系统发育学和计算生物学中,具有位点依赖性的进化模型的统计推断是一个长期存在的挑战。本文研究了在生物序列演化的依赖位点模型下近似边缘序列似然的问题。我们证明了一个马尔可夫链蒙特卡罗算法在从热启动初始化时,其采样潜在样本路径条件分布的混合时间具有多项式边界。随后,我们提出了一种用于近似边缘似然的序贯蒙特卡罗算法,并证明我们的混合时间边界可以与最新的重要性采样及有限样本SMC结果相结合,从而获得所生成估计器的有限样本近似误差边界。研究结果表明,所提出的SMC算法为许多实际关注问题提供了高效的随机化近似方案,相较于近期针对该问题开发的重要性采样器具有显著改进。我们的方法融合了在获取MCMC与SMC采样器边界方面的最新创新成果,可能适用于其他近似边缘似然和贝叶斯因子的相关问题。