DBSCAN is a fundamental spatial clustering algorithm with numerous practical applications. However, a bottleneck of the algorithm is in the worst case, the run time complexity is $O(n^2)$. To address this limitation, we propose a new grid-based algorithm for exact DBSCAN in Euclidean space called GriT-DBSCAN, which is based on the following two techniques. First, we introduce a grid tree to organize the non-empty grids for the purpose of efficient non-empty neighboring grids queries. Second, by utilising the spatial relationships among points, we propose a technique that iteratively prunes unnecessary distance calculations when determining whether the minimum distance between two sets is less than or equal to a certain threshold. We theoretically prove that the complexity of GriT-DBSCAN is linear to the data set size. In addition, we obtain two variants of GriT-DBSCAN by incorporating heuristics, or by combining the second technique with an existing algorithm. Experiments are conducted on both synthetic and real-world data sets to evaluate the efficiency of GriT-DBSCAN and its variants. The results of our analyses show that our algorithms outperform existing algorithms.
翻译:DBSCAN是一种基础空间集群算法,有许多实际应用。 但是, 算法的瓶颈是最糟糕的情况, 运行时间复杂程度为$O (n)2美元。 为解决这一限制, 我们提议在Euclidean 空间为叫做 GriT- DBSCAN 的精确 DBSCAN 使用一种新的基于网格的算法, 其依据是以下两种技术。 首先, 我们引入了一条网格树来组织非空网格, 以便高效的非空邻域格查询。 第二, 通过利用各点之间的空间关系, 我们建议一种技术, 在确定两组之间最小距离是否低于或等于某一阈值时, 迭代地进行不必要的距离计算。 我们理论上证明 GriT- DBSCAN 的复杂性与数据集大小是线性。 此外, 我们通过吸收超光学或将第二种技术与现有算法相结合, 将合成和现实世界数据集进行实验, 以评价GriT- DBBSCAN 的算法分析结果显示我们现有的变式。