Clustering multi-dimensional points is a fundamental task in many fields, and density-based clustering supports many applications as it can discover clusters of arbitrary shapes. This paper addresses the problem of Density-Peaks Clustering (DPC), a recently proposed density-based clustering framework. Although DPC already has many applications, its straightforward implementation incurs a quadratic time computation to the number of points in a given dataset, thereby does not scale to large datasets. To enable DPC on large datasets, we propose efficient algorithms for DPC. Specifically, we propose an exact algorithm, Ex-DPC, and two approximation algorithms, Approx-DPC and S-Approx-DPC. Under a reasonable assumption about a DPC parameter, our algorithms are sub-quadratic, i.e., break the quadratic barrier. Besides, Approx-DPC does not require any additional parameters and can return the same cluster centers as those of Ex-DPC, rendering an accurate clustering result. S-Approx-DPC requires an approximation parameter but can speed up its computational efficiency. We further present that their efficiencies can be accelerated by leveraging multicore processing. We conduct extensive experiments using synthetic and real datasets, and our experimental results demonstrate that our algorithms are efficient, scalable, and accurate.
翻译:组合多维点是许多领域的一项基本任务,而基于密度的集群支持许多应用,因为它可以发现任意形状的群集。本文件处理的是最近提议的密度基群框架Density-Peaks群集问题。虽然DPC已经有许多应用程序,但其直接实施需要按给定数据集的点数进行二次时间计算,因此不至于向大型数据集扩展。为了使DPC能够使用大型数据集,我们为DPC提出高效的算法。具体地说,我们提出了精确的算法(Ex-DPC)和两种近似算法(Approx-DPC和S-Approx-DPx-DPC)。在对DPC参数的合理假设下,我们算法是次方形的,即打破了四方形屏障。此外,Approx-DPC并不需要任何额外的参数,因此可以返回与Ex-DPC相同的群集中心相同的群集中心,得出准确的组合结果。S-Approx-DPC需要精确的参数,但可以加速其精确的算法。在对DPC参数的合理假设下,我们的精确的计算效率是,我们目前可以进一步展示我们的合成实验效率。我们通过高水平,我们可以进一步展示。