Inferring the infinitesimal rates of continuous-time Markov chains (CTMCs) is a central challenge in many scientific domains. This task is hindered by three factors: quadratic growth in the number of rates as the CTMC state space expands, strong dependencies among rates, and incomplete information for many transitions. We introduce a new Bayesian framework that flexibly models the CTMC rates by incorporating covariates through Gaussian processes (GPs). This approach improves inference by integrating new information and contributes to the understanding of the CTMC stochastic behavior by shedding light on potential external drivers. Unlike previous approaches limited to linear covariate effects, our method captures complex non-linear relationships, enabling fuller use of covariate information and more accurate characterization of their influence. To perform efficient inference, we employ a scalable Hamiltonian Monte Carlo (HMC) sampler. We address the prohibitive cost of computing the exact likelihood gradient by integrating the HMC trajectories with a scalable gradient approximation, reducing the computational complexity from $O(K^5)$ to $O(K^2)$, where $K$ is the number of CTMC states. Finally, we demonstrate our method on Bayesian phylogeography inference -- a domain where CTMCs are central -- showing effectiveness on both synthetic and real datasets.
翻译:推断连续时间马尔可夫链(CTMC)的无穷小速率是许多科学领域的核心挑战。这一任务受到三个因素的阻碍:随着CTMC状态空间的扩展,速率数量呈二次增长;速率之间存在强依赖性;以及许多转移信息不完整。我们引入了一种新的贝叶斯框架,通过高斯过程(GPs)纳入协变量,灵活地建模CTMC速率。该方法通过整合新信息改进了推断,并通过揭示潜在的外部驱动因素,有助于理解CTMC的随机行为。与先前仅限于线性协变量效应的方法不同,我们的方法能够捕捉复杂的非线性关系,从而更充分地利用协变量信息并更准确地表征其影响。为了进行高效推断,我们采用了一种可扩展的哈密顿蒙特卡洛(HMC)采样器。我们通过将HMC轨迹与可扩展的梯度近似相结合,解决了计算精确似然梯度的过高成本,将计算复杂度从$O(K^5)$降低到$O(K^2)$,其中$K$是CTMC的状态数。最后,我们在贝叶斯系统地理学推断——一个以CTMC为核心的领域——中展示了我们的方法,在合成和真实数据集上均显示了其有效性。