This paper considers a canonical clustering problem where one receives unlabeled samples drawn from a balanced mixture of two elliptical distributions and aims for a classifier to estimate the labels. Many popular methods including PCA and k-means require individual components of the mixture to be somewhat spherical, and perform poorly when they are stretched. To overcome this issue, we propose a non-convex program seeking for an affine transform to turn the data into a one-dimensional point cloud concentrating around $-1$ and $1$, after which clustering becomes easy. Our theoretical contributions are two-fold: (1) we show that the non-convex loss function exhibits desirable geometric properties when the sample size exceeds some constant multiple of the dimension, and (2) we leverage this to prove that an efficient first-order algorithm achieves near-optimal statistical precision without good initialization. We also propose a general methodology for clustering with flexible choices of feature transforms and loss objectives.
翻译:本文认为,当一个人从两种椭圆分布的平衡混合中获得未贴标签的样本时,会出现一个典型的组群问题。 许多流行的方法,包括五氯苯甲醚和k手段,要求混合物的个别成分具有一定的球形性,而且当它们被拉长时表现不佳。为了克服这一问题,我们提议了一个非软盘程序,寻求将数据转换成一维点云,集中在1美元和1美元左右,然后集合变得容易。我们的理论贡献有两重:(1) 我们表明,当样品大小超过一定的多个维度时,非软盘损失功能显示出可取的几何特性;(2) 我们利用这个方法证明,高效的第一阶算法在没有良好的初始化的情况下,可以实现近于最佳的统计精确度。我们还提出了一种总的方法,在特征变换和损失目标上灵活选择组合。