Bayesian Additive Regression Trees(BART) is a Bayesian nonparametric approach which has been shown to be competitive with the best modern predictive methods such as random forest and Gradient Boosting Decision Tree.The sum of trees structure combined with a Bayesian inferential framework provide a accurate and robust statistic method.BART variant named SBART using randomized decision trees has been developed and show practical benefits compared to BART. The primary bottleneck of SBART is the speed to compute the sufficient statistics and the publicly avaiable implementation of the SBART algorithm in the R package is very slow.In this paper we show how the SBART algorithm can be modified and computed using single program,multiple data(SPMD) distributed computation with the Message Passing Interface(MPI) library.This approach scales nearly linearly in the number of processor cores, enabling the practitioner to perform statistical inference on massive datasets. Our approach can also handle datasets too massive to fit on any single data repository.We have made modification to this algorithm to make it capable to handle classfication problem which can not be done with the original R package.With data experiments we show the advantage of distributed SBART for classfication problem compared to BART.
翻译:Bayesian Additive Regression 树(BART)是一种巴伊西亚非参数性的非参数性方法,已证明与随机森林和梯级推动决定树等最佳现代预测方法具有竞争力。 树木结构加贝巴伊西亚推论框架提供了准确而有力的统计方法。 已经开发出名为SBART(使用随机决定树的SBART)的变量, 并显示与BART(BART)相比的实际效益。 SBART的主要瓶颈是计算足够统计数据的速度, R 软件包中SBART算法的公开实施速度非常缓慢。 在本文中,我们展示了如何用单一程序修改和计算SBART算法的算法。 多PROD(SPMD) 与B 信息传输界面库的分布计算方法提供了准确而有力的数据方法。 这个方法几乎线性地在处理大量数据集的统计推算法上, 从业人员也可以处理过于庞大的数据设置,无法适应任何单一数据存储器。 我们对这一算法进行了修改, 使其能够处理原始的SARTART优势与Sfrical 问题进行比较, 问题。