The $k$-means is one of the most important unsupervised learning techniques in statistics and computer science. The goal is to partition a data set into many clusters, such that observations within clusters are the most homogeneous and observations between clusters are the most heterogeneous. Although it is well known, the investigation of the asymptotic properties is far behind, leading to difficulties in developing more precise $k$-means methods in practice. To address this issue, a new concept called clustering consistency is proposed. Fundamentally, the proposed clustering consistency is more appropriate than the previous criterion consistency for the clustering methods. Using this concept, a new $k$-means method is proposed. It is found that the proposed $k$-means method has lower clustering error rates and is more robust to small clusters and outliers than existing $k$-means methods. When $k$ is unknown, using the Gap statistics, the proposed method can also identify the number of clusters. This is rarely achieved by existing $k$-means methods adopted by many software packages.
翻译:$-k$是统计和计算机科学中最重要的未经监督的学习技术之一。目标是将数据集分成许多组群,使组群内部的观测最均匀,组群内部的观测最为多样。虽然众所周知,对无症状特性的调查远远落后,导致在实践中难以制定更精确的美元-汇率方法。为了解决这一问题,提出了称为集群一致性的新概念。从根本上说,拟议的集群一致性比先前的集群方法标准一致性更为合适。使用这一概念,提出了新的美元-汇率方法。发现拟议的美元-汇率方法的组群误差率较低,对小型组群和外值比现有的美元-汇率方法更强。如果不了解美元,则使用差距统计,拟议的方法也可以确定集群的数目。许多软件包采用的现有美元-汇率方法很少达到这一点。