Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning since they can deal with non-hyperspherical clusters and are robustness to handle outliers. However, the runtime of density-based algorithms are heavily dominated by finding fixed-radius near neighbors and calculating the density, which is time-consuming. Meanwhile, the traditional acceleration methods using indexing technique such as KD tree is not effective in processing high-dimensional data. In this paper, we propose a fast region query algorithm named fast principal component analysis pruning (called FPCAP) with the help of the fast principal component analysis technique in conjunction with geometric information provided by principal attributes of the data, which can process high-dimensional data and be easily applied to density-based methods to prune unnecessary distance calculations when finding neighbors and estimating densities. As an application in density-based clustering methods, FPCAP method was combined with the Density Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. And then, an improved DBSCAN (called IDBSCAN) is obtained, which preserves the advantage of DBSCAN and meanwhile, greatly reduces the computation of redundant distances. Experiments on seven benchmark datasets demonstrate that the proposed algorithm improves the computational efficiency significantly.
翻译:在模式识别和机器学习中,基于密度的集群算法被广泛用于在模式识别和机器学习中发现集群,因为这些算法可以处理非同步的群集,并且具有处理外部线的稳健性;然而,基于密度算法的运行时间主要取决于在邻居附近找到固定的半径线和计算密度,这是耗时的。与此同时,使用KD树等指数化技术的传统加速法在处理高维数据方面是无效的。在本文中,我们提议采用快速区域查询算法,称为快速主元件分析运行(称为FPCAP),在快速主件分析技术的帮助下,结合数据主要属性提供的几何学信息,这些数据可以处理高维数据,并很容易地应用于基于密度的计算方法,以在寻找邻居和估计密度时进行不必要的距离计算。作为基于密度的集群方法的应用,FCAPCA方法与基于密度的空间组合法的Nise(DBSCAN)算法(称为IDBSCAN)计算法(称为DBSCAN)的快速主控件分析技术。然后,得到改进了DBSCAN的快速主控件分析技术,这可以大大地保持DBBSAN的计算方法的优势,同时显示DBBCAN的升级。