Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed $K$-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the $K$-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original $K$-means SDP with substantially reduced runtime.
翻译:半限制编程(SDP)是解决诸如集群等大量计算困难问题的有力工具。 尽管精度很高, 半限制程序在实践中往往过于缓慢, 大(甚至中度) 数据集的可缩放性差强。 在本文中, 我们引入了一种线性时间复杂算法, 以约制 SDP 松散 $K$- 比例组合。 提议的素描和升( SDP) 方法在子抽样数据集上解决了 SDP, 然后通过近中心机器人圆环程序将解决方案推广到所有数据点。 事实证明, SL 方法拥有与全数据集的 $K 单位SDP 相似的精确回收阈值, 众所周知, 在高斯混合模型下的信息- 理论上是紧凑的 。 当集体大小不平衡时, SL 方法可以用增强的感应力特性进行适应。 我们的模拟实验显示, 拟议的方法的统计准确性优于最先进的快速组合算法, 而不牺牲太高的计算效率, 并且可以与原始的SK- DP 比例相比。