The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of taking advantage of background information in data clustering is called semi-supervised or constrained clustering. In this paper, we present a new branch-and-bound algorithm for semi-supervised MSSC, where background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the lower bound procedure, we solve the semidefinite programming relaxation of the MSSC discrete optimization model, and we use a cutting-plane procedure for strengthening the bound. For the upper bound, instead, by using integer programming tools, we propose an adaptation of the k-means algorithm to the constrained case. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points with different combinations of must-link and cannot-link constraints and with a generic number of features. This problem size is about four times larger than the one of the instances solved by state-of-the-art exact algorithms.
翻译:在数学优化和机器学习研究的交叉点上,利用数据分组中的背景资料的问题是半监督或限制性的组合。在本文中,我们为半监督的混合组合(MSCS)提出了一个新的分支和约束算法,其背景知识是作为双对必须链接和无法链接的限制纳入的。在较低约束程序方面,我们解决了MSCS离散优化模型的半无限期编程松绑,并使用切开程序加强约束。对于上层,我们建议使用整数编程工具,使K手段算法适应受限制的情况。首先,拟议的全球优化算法能够有效地解决800个现实世界情况,同时结合不同联结,无法连接限制模型的组合,并且使用比一般数的运算法大四倍。这个规模的问题是总式的大小。