Clustering is an important exploratory data analysis technique to group objects based on their similarity. The widely used $K$-means clustering method relies on some notion of distance to partition data into a fewer number of groups. In the Euclidean space, centroid-based and distance-based formulations of the $K$-means are equivalent. In modern machine learning applications, data often arise as probability distributions and a natural generalization to handle measure-valued data is to use the optimal transport metric. Due to non-negative Alexandrov curvature of the Wasserstein space, barycenters suffer from regularity and non-robustness issues. The peculiar behaviors of Wasserstein barycenters may make the centroid-based formulation fail to represent the within-cluster data points, while the more direct distance-based $K$-means approach and its semidefinite program (SDP) relaxation are capable of recovering the true cluster labels. In the special case of clustering Gaussian distributions, we show that the SDP relaxed Wasserstein $K$-means can achieve exact recovery given the clusters are well-separated under the $2$-Wasserstein metric. Our simulation and real data examples also demonstrate that distance-based $K$-means can achieve better classification performance over the standard centroid-based $K$-means for clustering probability distributions and images.
翻译:集束是一种重要的探索性数据分析技术,用于根据相似性对物体进行分组。广泛使用的 $K美元 平均值集束法基于某种距离概念,将数据分成较少的组别。在欧clidean空间,以机器人为基础的配方和以距离为基础的配方相当于以美元为单位的基体。在现代机器学习应用中,数据通常作为概率分布和处理量值数据的自然概括性而产生,是使用最佳的运输指标。由于瓦瑟斯坦空间的非负性亚历山德罗夫曲折曲性,酒吧中心员受到规律性和不腐败问题的困扰。瓦塞尔斯坦酒吧中心员的特殊行为可能使以机器人为基础的配方无法代表集群内的数据点,而更直接以距离为基础的基价基价方法和处理量值的半定型程序(SDP)能够恢复真正的集群标签。在基于Gaussian的组合中,我们展示SDP较宽松的瓦瑟斯坦 $K 价格-比例图像问题,在基于成本的模型模型下,也可以实现精确的回收。