Measuring similarity between two objects is the core operation in existing clustering algorithms in grouping similar objects into clusters. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. The proposed clustering procedure utilizes this new measure to characterize every cluster grown from a seed object. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, we show that the proposed algorithm is more effective and runs orders of magnitude faster when applying to datasets of millions of data points, on a commonly used computing machine.
翻译:测量两个对象之间的相似性是现有组群算法将相似对象分组成组的核心操作。 本文介绍了一种称为点定内核的新相似度测量法, 计算一个对象和一组对象之间的相似性。 拟议的组群程序利用这一新度测量从种子对象中生长出来的每个组群的特点。 我们显示, 新的组群程序既有效又高效, 使其能够处理大型数据集。 相比之下, 现有的组群算法要么有效, 要么有效。 与最先进的密度高峰组群和可缩放内核K- 平均值组群相比, 我们显示, 与常用计算机机器上数以百万计的数据集相比, 拟议的算法更为有效, 并且运行速度更快。