In recent years, large-scale Bayesian learning draws a great deal of attention. However, in big-data era, the amount of data we face is growing much faster than our ability to deal with it. Fortunately, it is observed that large-scale datasets usually own rich internal structure and is somewhat redundant. In this paper, we attempt to simplify the Bayesian posterior via exploiting this structure. Specifically, we restrict our interest to the so-called well-clustered datasets and construct an \emph{approximate posterior} according to the clustering information. Fortunately, the clustering structure can be efficiently obtained via a particular clustering algorithm. When constructing the approximate posterior, the data points in the same cluster are all replaced by the centroid of the cluster. As a result, the posterior can be significantly simplified. Theoretically, we show that under certain conditions the approximate posterior we construct is close (measured by KL divergence) to the exact posterior. Furthermore, thorough experiments are conducted to validate the fact that the constructed posterior is a good approximation to the true posterior and much easier to sample from.
翻译:近年来,大规模的巴伊西亚学习引起了人们的极大关注。然而,在大数据时代,我们所面临的数据数量比我们应对这些数据的能力增长得要快得多。幸运的是,我们观察到,大型数据集通常拥有丰富的内部结构,而且有些多余。在本文中,我们试图通过利用这一结构来简化巴伊西亚后遗迹。具体地说,我们把兴趣限制在所谓的集成数据集上,并根据集群信息构建一个近似后遗迹。幸运的是,集群结构可以通过特定的集成算法有效获得。当构建近似后遗图时,同一组的数据点都由该组的中间体取代。结果之一是,后遗迹可以大大简化。理论上,我们表明在某些条件下,我们建造的近似后遗迹(根据KL差异衡量)接近精确的后遗迹。此外,我们进行了彻底的实验,以证实以下事实,即建造的后遗迹与真实的后遗迹是很好的近似近似,而且从抽样看要容易得多。