将大型数据集分组,对低密度分离超强机床进行递增估计 (Clustering Large Data Sets with Incremental Estimation of Low-density Separating Hyperplanes)

An efficient method for obtaining low-density hyperplane separators in the unsupervised context is proposed. Low density separators can be used to obtain a partition of a set of data based on their allocations to the different sides of the separators. The proposed method is based on applying stochastic gradient descent to the integrated density on the hyperplane with respect to a convolution of the underlying distribution and a smoothing kernel. In the case where the bandwidth of the smoothing kernel is decreased towards zero, the bias of these updates with respect to the true underlying density tends to zero, and convergence to a minimiser of the density on the hyperplane can be obtained. A post-processing of the partition induced by a collection of low-density hyperplanes yields an efficient and accurate clustering method which is capable of automatically selecting an appropriate number of clusters. Experiments with the proposed approach show that it is highly competitive in terms of both speed and accuracy when compared with relevant benchmarks. Code to implement the proposed approach is available in the form of an R package from https://github.com/DavidHofmeyr/iMDH.

翻译：提议采用低密度分离器,根据对分离器不同侧面的分配分配情况,对一组数据进行分解; 提议的方法是,在底部分布和滑动内核的变化方面,对超高机的综合密度应用随机梯度梯度下降法; 在平滑内核的带宽降低到零的情况下,可使用低密度分隔器,根据对分离器不同侧面的分配情况,对一组数据进行分解; 提议的方法是,将低密度高原高平板引起的分区后处理产生一种高效和准确的集群方法,能够自动选择适当数量的集群; 与拟议方法进行的实验表明,与相关基准相比,在速度和准确性两方面都具有高度竞争力。实施拟议方法的守则以https://github.com/DavidHofmeyr/MHDH.MDH.