Sum-of-norms clustering is a popular convexification of $K$-means clustering. We show that, if the dataset is made of a large number of independent random variables distributed according to the uniform measure on the union of two disjoint balls of unit radius, and if the balls are sufficiently close to one another, then sum-of-norms clustering will typically fail to recover the decomposition of the dataset into two clusters. As the dimension tends to infinity, this happens even when the distance between the centers of the two balls is taken to be as large as $2\sqrt{2}$. In order to show this, we introduce and analyze a continuous version of sum-of-norms clustering, where the dataset is replaced by a general measure. In particular, we state and prove a local-global characterization of the clustering that seems to be new even in the case of discrete datapoints.
翻译:中枢组合是一个流行的 $K$- 平均值组合的混凝土。 我们显示, 如果数据集由大量独立的随机变量组成, 分布在单位半径两个分解球的组合上, 如果球相互足够接近, 中枢组合通常无法恢复数据组合分为两个组的分解。 由于维度趋向于无限化, 即便两个球中心之间的距离被假定为2\ sqrt{2}美元。 为了显示这一点, 我们引入并分析一个连续版本的“ 核心总” 组合, 数据组合被一个一般性的尺度取代。 特别是, 我们说明并证明, 即使是离散的数据点, 组合的局部- 全球特征似乎也是新的。