Density-based clustering is a commonly used tool in data science. Today many data science works are utilizing high-dimensional neural embeddings. However, traditional density-based clustering techniques like DBSCAN have a degraded performance on high-dimensional data. In this paper, we propose LAF, a generic learned accelerator framework to speed up the original DBSCAN and the sampling-based variants of DBSCAN on high-dimensional data with angular distance metric. This framework consists of a learned cardinality estimator and a post-processing module. The cardinality estimator can fast predict whether a data point is core or not to skip unnecessary range queries, while the post-processing module detects the false negative predictions and merges the falsely separated clusters. The evaluation shows our LAF-enhanced DBSCAN method outperforms the state-of-the-art efficient DBSCAN variants on both efficiency and quality.
翻译:在数据科学中,基于密度的集群是一种常用的工具。今天,许多数据科学工作正在利用高维神经嵌入。然而,像DBSCAN这样的传统基于密度的集群技术在高维数据上性能下降。在本文中,我们建议LAF,即一个通用的学习加速器框架,以加速原始DBSCAN和DBSCAN在具有角距离度的高维数据上的采样变体。这个框架包括一个有学识的基点估计仪和一个后处理模块。基点估计器可以快速预测一个数据点是核心,还是不跳过不必要的范围查询,而后处理模块则检测虚假的负面预测,并合并错误分离的集群。评价显示我们的LAF增强的DBSCAN方法在效率和质量上都超越了最先进的高效的DBSCAN变体。