We propose a general method for distributed Bayesian model choice, using the marginal likelihood, where a data set is split in non-overlapping subsets. These subsets are only accessed locally by individual workers and no data is shared between the workers. We approximate the model evidence for the full data set through Monte Carlo sampling from the posterior on every subset generating a model evidence per subset. The results are combined using a novel approach which corrects for the splitting using summary statistics of the generated samples. Our divide-and-conquer approach enables Bayesian model choice in the large data setting, exploiting all available information but limiting communication between workers. We derive theoretical error bounds that quantify the resulting trade-off between computational gain and loss in precision. The embarrassingly parallel nature yields important speed-ups when used on massive data sets as illustrated by our real world experiments. In addition, we show how the suggested approach can be extended to model choice within a reversible jump setting that explores multiple feature combinations within one run.
翻译:我们建议采用一种通用方法来选择分布式贝叶西亚模式,使用边际可能性,将数据集分成非重叠子集。这些子集只能由个别工人在当地获取,工人之间没有共享数据。我们从每个子集的后子集取样中,从每个子集的后子层中,将蒙特卡洛取样中的全部数据集的示范证据加以比较,每个子集产生一个示范证据。这些结果结合了一种新颖的方法,利用所生成样品的简要统计数据来纠正分裂。我们的分而治之方法使贝叶西亚模式能够在大型数据设置中作出选择,利用所有可用的信息,但限制工人之间的交流。我们从理论上得出错误界限,将计算得来的计算收益与精确损失之间的取舍。令人尴尬的平行性在用于大规模数据集时,如我们真实世界实验所显示的那样,在大规模数据集中使用了重要的加速度。此外,我们展示了如何扩大所建议的方法,以便在一个可逆跳动的设置中进行模型选择,在一次运行中探索多重特征组合。