Spherical k-Means is frequently used to cluster document collections because it performs reasonably well in many settings and is computationally efficient. However, the time complexity increases linearly with the number of clusters k, which limits the suitability of the algorithm for larger values of k depending on the size of the collection. Optimizations targeted at the Euclidean k-Means algorithm largely do not apply because the cosine distance is not a metric. We therefore propose an efficient indexing structure to improve the scalability of Spherical k-Means with respect to k. Our approach exploits the sparsity of the input vectors and the convergence behavior of k-Means to reduce the number of comparisons on each iteration significantly.
翻译:球形 k- Means 通常用于集束文件收藏, 因为它在许多设置中表现得相当好,而且具有计算效率。 但是,随着集束 k 的数量, 时间复杂性会直线增加, 从而限制了算法对较大 k 值的适合性, 取决于收集的大小。 以 Euclidean k- Means 算法为对象的优化应用在很大程度上并不适用, 因为焦线距离不是一个尺度。 因此, 我们提出一个高效的索引结构, 以提高 球形 k- Means 相对于 k 的可缩缩缩性。 我们的方法利用了输入矢量的宽度和 k- Means 的趋同行为, 以显著减少每次迭代的比较次数 。