Clustering large, mixed data is a central problem in data mining. Many approaches adopt the idea of k-means, and hence are sensitive to initialisation, detect only spherical clusters, and require a priori the unknown number of clusters. We here develop a new clustering algorithm for large data of mixed type, aiming at improving the applicability and efficiency of the peak-finding technique. The improvements are threefold: (1) the new algorithm is applicable to mixed data; (2) the algorithm is capable of detecting outliers and clusters of relatively lower density values; (3) the algorithm is competent at deciding the correct number of clusters. The computational complexity of the algorithm is greatly reduced by applying a fast k-nearest neighbors method and by scaling down to component sets. We present experimental results to verify that our algorithm works well in practice. Keywords: Clustering; Big Data; Mixed Attribute; Density Peaks; Nearest-Neighbor Graph; Conductance.
翻译:大量混合数据是数据开采的一个中心问题。 许多方法采用了 k 手段的概念, 因而对初始化十分敏感, 只探测球类组, 并且先验地要求数量未知的组群。 我们在这里为混合型的大数据开发一种新的群集算法, 目的是提高峰值调查技术的适用性和效率。 改进有三重:(1) 新的算法适用于混合数据;(2) 算法能够探测密度值相对较低的离子和组群;(3) 算法有能力决定组群的正确数量。 算法的计算复杂性通过采用快速的 k 最接近的邻里方法和缩小到组件组来大大降低。 我们提出了实验结果, 以核实我们的算法在实践中效果良好。 关键词: 组合; 大数据; 混合属性; Density Peaks; Neest- Neighbor 图表; 行为。