This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, however, it depends greatly on how the algorithm adapts to an architecture of the computer system which it is executed on. Our proposed SIVF employs invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations between a data object and centroids of all the clusters. To maximize the ICP performance, SIVF exploits for a centroid set an inverted-file that is structured so as to reduce pipeline hazards. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions rather than less similarity calculations.
翻译:本文展示了一种结构友好的K- means群集算法,称为SIVF,用于大规模和高维分散数据集。对数值的及时效率通常以类似计算等费用高昂的操作数量来衡量。然而,在实践中,它在很大程度上取决于算法如何适应计算机系统的结构。我们提议的SIVF使用基于无变量的中子机器人过滤器(ICP)来减少数据对象与所有分类组的类固醇之间的相似性计算数量。为了最大限度地提高比较方案性能,SIVF利用一个为减少管道危险而设置的反向文件。我们在实际大规模文件数据集的实验中表明,SIVF的运行速度高于现有算法,记忆消耗也低于现有的算法。我们的业绩分析表明,SIVF通过抑制缓存误差和分支误差数量的性能降解因素,而不是较少相似性计算,从而达到更高的速度。