Determining the number of clusters in a dataset is a fundamental issue in data clustering. Many methods have been proposed to solve the problem of selecting the number of clusters, considering it to be a problem with regard to model selection. This paper proposes a greedy algorithm that automatically selects a suitable number of clusters based on a probability distribution model framework. The algorithm includes two components. First, a generalization of Kohonen's self-organizing map (SOM), which has nodes linked to a probability distribution model, and which enables the algorithm to search for the winner based on the likelihood of each node, is introduced. Second, the proposed method uses a graph structure and a neighbor defined by the length of the shortest path between nodes, in contrast to Kohonen's SOM in which the nodes are fixed in the Euclidean space. This implementation makes it possible to update its graph structure by cutting links to weakly connected nodes to avoid unnecessary node deletion. The weakness of a node connection is measured using the Kullback--Leibler divergence and the redundancy of a node is measured by the minimum description length (MDL). This updating step makes it easy to determine the suitable number of clusters. Compared with existing methods, our proposed method is computationally efficient and can accurately select the number of clusters and perform clustering.
翻译:确定数据集中的组群数是数据分组的一个基本问题。 已经提出了许多方法来解决选择组群数的问题, 认为这在模式选择方面是一个问题 。 本文建议了一种贪婪的算法, 根据概率分布模型框架自动选择适当数量的组群。 算法包括两个组成部分 。 首先, 将Kohoonen 的自组织地图( SOM) 进行概括化, 该图与概率分布模型有节点联系, 使算法能够根据每个节点的可能性搜索优胜者 。 其次, 提议的方法使用图表结构以及以节点之间最短路径长度为定义的相邻点。 与Kohoonen 的 SOM 相比, 节点在 Euclidean 空间中固定了适当的组群。 实施此算法可以通过切换连接薄弱的节点来更新其图形结构, 以避免不必要的节点删除。 节点连接的弱点是通过 Kullack- Leiper 差异和节点的冗余度由最小描述( MDL) 以最小长度来测量。 这个更新步骤可以确定, 与现有组群集的精确的计算方法可以确定。 。