Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a risk function via penalization. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.
翻译:集束是一种将数据点分组成基于类似特征的组别而通常不受监督的机械学习技术。 我们在此侧重于被污染数据的未经监督的组群, 即由于其稳健性, K 中间体应优于 K 手段。 更确切地说, 我们集中关注一个共同的组群问题: 如何选择组群数量? 此处建议的答案是考虑选择最佳组群数目, 以通过惩罚尽量减少风险功能。 在本文中, 我们为我们的标准找到一个合适的惩罚形状, 并得出一个相关的或手法型的不平等。 最后, 使用不同类型 K 中间体算法的这一方法的性能通过模拟研究与其他流行技术进行比较。 所有研究的算法都可以在CRAN 上的 Kmemonds 包中找到。