Determining the number of clusters in a dataset is a fundamental issue in data clustering. Many methods have been proposed to solve the problem of selecting the number of clusters, considering it to be a problem with regard to model selection. This paper proposes an efficient algorithm that automatically selects a suitable number of clusters based on a probability distribution model framework. The algorithm includes the following two components. First, a generalization of Kohonen's self-organizing map (SOM) is introduced. In Kohonen's SOM, clusters are modeled as mean vectors. In the generalized SOM, each cluster is modeled as a probabilistic distribution and constructed by samples classified based on the likelihood. Second, the dynamically updating method of the SOM structure is introduced. In Kohonen's SOM, each cluster is tied to a node of a fixed two-dimensional lattice space and learned using neighborhood relations between nodes based on Euclidean distance. The extended SOM defines a graph with clusters as vertices and neighborhood relations as links and updates the graph structure by cutting weakly-connection and unnecessary vertex deletions. The weakness of a link is measured using the Kullback--Leibler divergence, and the redundancy of a vertex is measured using the minimum description length. Those extensions make it efficient to determine the appropriate number of clusters. Compared with existing methods, the proposed method is computationally efficient and can accurately select the number of clusters.
翻译:在数据组中确定组群数目是数据组的一个根本性问题。 已经提出了许多方法来解决选择组群数目的问题, 认为这在模式选择上是一个问题。 本文提出一个高效算法, 根据概率分布模型框架自动选择适当数目的组群。 算法包括以下两个组成部分。 首先, 将Kohoonen 的自组织地图( SOM) 加以概括化。 在 Kohoonen 的 SOM 中, 组群以平均矢量为模型。 在通用的 SOM 中, 每个组群以概率分布为模型, 由根据可能性分类的样本组成。 其次, 引入动态更新 SOM 结构的方法。 在 Kohokoonen 的 SOM 模式中, 每个组群集都与固定的二维拉特空间的节点连接起来, 并学习使用基于 Eucloidean 距离的节点之间的邻系关系。 扩展的SOM 将组群集作为平均数和邻系关系作为中继链接, 更新图表结构, 方法是通过减少薄弱的连接和不必要的垂直垂直垂直垂直的垂直递增 。 将 。 将最小的链接 。 将 与最小的链接 确定为最小的链接 。 。 。 和最小的链接 与最小的链接 与最小的链接 。 与最小的 。