Gaussian Mixture 模型的大型大规模聚集, 带有变化式 EM (Large Scale Clustering with Variational EM for Gaussian Mixture Models)

This paper represents a preliminary (pre-reviewing) version of a sublinear variational algorithm for isotropic Gaussian mixture models (GMMs). Further developments of the algorithm for GMMs with diagonal covariance matrices (instead of isotropic clusters) and their corresponding benchmarking results have been published by TPAMI (doi:10.1109/TPAMI.2021.3133763) in the paper "A Variational EM Acceleration for Efficient Clustering at Very Large Scales". We kindly refer the reader to the TPAMI paper instead of this much earlier arXiv version (the TPAMI paper is also open access). Publicly available source code accompanies the paper (see https://github.com/variational-sublinear-clustering). Please note that the TPAMI paper does not contain the benchmark on the 80 Million Tiny Images dataset anymore because we followed the call of the dataset creators to discontinue the use of that dataset. The aim of the project (which resulted in this arXiv version and the later TPAMI paper) is the exploration of the current efficiency and large-scale limits in fitting a parametric model for clustering to data distributions. To reduce computational complexity, we used a clustering objective based on truncated variational EM (which reduces complexity for many clusters) in combination with coreset objectives (which reduce complexity for many data points). We used efficient coreset construction and efficient seeding to translate the theoretical sublinear complexity gains into an efficient algorithm. In applications to standard large-scale benchmarks for clustering, we then observed substantial wall-clock speedups compared to already highly efficient clustering approaches. To demonstrate that the observed efficiency enables applications previously considered unfeasible, we clustered the entire and unscaled 80 Million Tiny Images dataset into up to 32,000 clusters.

翻译：本文代表了异端高斯混合模型(GMM)的亚线性变异算法的初步(预审)版本。 TPAMI( Doi: 10. 109/ TPAMI. 20211.3133763)在“ 极大型规模中高效分组的变换 EM 加速” 的论文中公布了带有双向相异矩阵矩阵矩阵( GMMM) 及其相应的基准结果。我们恳请读者参考TPAMI文件, 而不是这个更早的 ARXiv 版本( TPAMI 纸张也是开放访问 ) 。 GMMS的算法的进一步发展与双向复变矩阵矩阵( 参见 https://github.com/ variationalal- subline- grouping) 相匹配。请注意, TPAMAMIP文件不再包含800,000,000 Tiny IDI 图像数据集的基准设置。我们跟踪数据集创建者呼吁停止使用该数据集。这个项目的目的( 导致这个非正反向前端的AXIV版本, 和后端端TPAMILL 将一个快速数据变变化到我们用于一个用于当前快速的大规模数据流的计算中, 高级的计算中, 高级的计算中, 高级的计算中用来将一个用于高级数据流的计算中所使用的数据流的计算中一个用于一个用于一个高级数据流的计算。