Background: Gaussian mixture modeling is a fundamental tool in clustering, as well as discriminant analysis and semiparametric density estimation. However, estimating the optimal model for any given number of components is an NP-hard problem, and estimating the number of components is in some respects an even harder problem. Findings: In R, a popular package called mclust addresses both of these problems. However, Python has lacked such a package. We therefore introduce AutoGMM, a Python algorithm for automatic Gaussian mixture modeling, and its hierarchical version, HGMM. AutoGMM builds upon scikit-learn's AgglomerativeClustering and GaussianMixture classes, with certain modifications to make the results more stable. Empirically, on several different applications, AutoGMM performs approximately as well as mclust, and sometimes better. Conclusions: AutoMM, a freely available Python package, enables efficient Gaussian mixture modeling by automatically selecting the initialization, number of clusters and covariance constraints.
翻译:: 高斯混合建模是集群的基本工具,也是对不同成分进行分析和半参数密度估计的基本工具。 但是, 估计任何特定数量的成分的最佳模型是一个NP硬问题, 估计组件的数量在某些方面甚至是一个更困难的问题。 结果 : 在 R 中, 流行的称为 mlult 的包包解决了这两个问题。 但是, Python 缺乏这样的包。 因此, 我们引入了AutoGMM, 自动高斯混合建模的Python算法及其等级版本, HGMM。 AutoGMMM 以Scikit- Learn 的集聚性结晶和高斯混合类为基础, 并进行某些修改以使结果更加稳定。 在多个不同的应用中, AutoGMMM 的演练大致是封闭性的,有时更好。 结论是: AutomM, 一个自由提供的 Python 包, 通过自动选择初始化、 集体数量和耐受约束的组合, 使高斯混合能够进行高效的模型建模。