易感调整的组合多种数据半无限期方案 (Likelihood adjusted semidefinite programs for clustering heterogeneous data)

Clustering is a widely deployed unsupervised learning tool. Model-based clustering is a flexible framework to tackle data heterogeneity when the clusters have different shapes. Likelihood-based inference for mixture distributions often involves non-convex and high-dimensional objective functions, imposing difficult computational and statistical challenges. The classic expectation-maximization (EM) algorithm is a computationally thrifty iterative method that maximizes a surrogate function minorizing the log-likelihood of observed data in each iteration, which however suffers from bad local maxima even in the special case of the standard Gaussian mixture model with common isotropic covariance matrices. On the other hand, recent studies reveal that the unique global solution of a semidefinite programming (SDP) relaxed $K$-means achieves the information-theoretically sharp threshold for perfectly recovering the cluster labels under the standard Gaussian mixture model. In this paper, we extend the SDP approach to a general setting by integrating cluster labels as model parameters and propose an iterative likelihood adjusted SDP (iLA-SDP) method that directly maximizes the \emph{exact} observed likelihood in the presence of data heterogeneity. By lifting the cluster assignment to group-specific membership matrices, iLA-SDP avoids centroids estimation -- a key feature that allows exact recovery under well-separateness of centroids without being trapped by their adversarial configurations. Thus iLA-SDP is less sensitive than EM to initialization and more stable on high-dimensional data. Our numeric experiments demonstrate that iLA-SDP can achieve lower mis-clustering errors over several widely used clustering methods including $K$-means, SDP and EM algorithms.

翻译：模块化组合是一个广泛部署且不受监督的学习工具。模型化组合是一个灵活的框架, 用以在集群有不同形状时解决数据异质性的数据。以类似方式为基础的混合物分布的杂交推力往往涉及非康维x和高维客观功能, 给计算和统计带来困难。典型的期待- 最大化( EM) 算法是一种计算性机动性迭代法, 使每迭代中观察到的数据的日志相似性最小化。然而, 即使在标准高氏混合模型和常见异性调调调色矩阵的特殊情况下, 也存在差差的地方最大值。最近的研究表明, 半definite 程序(SDP) 的独特的全球解决方案(SDP) 放松了 $K 的计算方法, 从而在标准 Gausils 混合物模型下完全恢复群集标签。我们将SDP 方法推广到一般设置, 集化标签作为模型参数, 并提议在SDP( iDP) 低度( i-L) 数据流化中, 直接显示他所观察到的S- dal- dal- dal- droadex- dismex- disal) 方法。