Markov Chain Monte Carlo (MCMC) is a well-established family of algorithms primarily used in Bayesian statistics to sample from a target distribution when direct sampling is challenging. Existing work on Bayesian decision trees uses MCMC. Unfortunately, this can be slow, especially when considering large volumes of data. It is hard to parallelise the accept-reject component of the MCMC. None-the-less, we propose two methods for exploiting parallelism in the MCMC: in the first, we replace the MCMC with another numerical Bayesian approach, the Sequential Monte Carlo (SMC) sampler, which has the appealing property that it is an inherently parallel algorithm; in the second, we consider data partitioning. Both methods use multi-core processing with a HighPerformance Computing (HPC) resource. We test the two methods in various study settings to determine which method is the most beneficial for each test case. Experiments show that data partitioning has limited utility in the settings we consider and that the use of the SMC sampler can improve run-time (compared to the sequential implementation) by up to a factor of 343.
翻译:Markov Chain Monte Carlo(MCMC)是一套公认的算法,主要用于巴伊西亚统计,在直接取样具有挑战性时,从目标分布取样。关于巴伊西亚决策树的现有工作使用MCMC。 不幸的是,这可能会是缓慢的,特别是在考虑大量数据时。很难平行使用MCMC的接受-拒绝部分。无一无一无二地,我们提出在MC中利用平行法的两种方法:首先,我们用另一种数字的Bayesian方法取代MC,代之以另一个数字的Bayesian方法,即Conte Monte Carlo(SMC)取样器,该方法的特性具有吸引力,具有内在的平行算法;第二,我们考虑数据分割。两种方法都使用高性计算(HPC)资源的多极处理方法。我们在各种研究环境中测试两种方法,以确定哪种方法对每个测试案例最有利。实验表明,数据分割在我们所考虑的环境中用处有限,使用SMC取样器可以改进运行时间(相对于连续实施),最高为343。