We study constrained clustering, where constraints guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. Most existing constrained clustering models can only utilize one category of constraints at a time. We enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As the two categories provide different useful information, utilizing both allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the unified formulation. To solve this, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. We derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers. Extensive experiments on both synthetic and real data demonstrate when: (1) utilizing a single category of constraint, the proposed model is superior to or competitive with SOTA constrained clustering models, and (2) utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experiments show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to 2%-5% in classical clustering metrics, e.g. Adjusted Random, Mirkin's, and Huber's, indices outerperfomring other methods.
翻译:我们研究的是限制组群,这是组群过程的制约。在现有的工作中,已经广泛探索了两类制约,即对等和基本限制。对等限制强制了两个情况组群标签,两个情况组群标签相同(对链限制)或不同(对链限制)。红心限制鼓励组群大小,以满足用户指定的分布。大多数现有的受限制组群模型只能同时使用一种限制类别。我们将以上两类作为统一组群模式,从标准K-手段的整数程序配方开始。由于这两个类别提供了不同的有用信息,两者都有利于更好地组群业绩。然而,由于对两个情况组群群的组合群群的组合标签是相同的,因此,对两个情况组群群的组合很难执行。为了解决这个问题,我们使用了两种技术:用两个连续限制的组合群群群群群的组合来取代两组的组合体;大多数现有的受限制组群集模型只能同时使用一个类别,我们用简单的限制来进行同样的调整,这可以通过 Alternationalationalational Diries 方法来有效解决。在综合和真实的数据组组群集中进行广泛的实验,在使用单一组群集群集组群集体的模型时,在使用一个单一组群集中,使用一个单一组群集中采用一个单一组群集的衡量的制约,而采用提议的单一组群集组群集组群集组群群集体的制约,而采用提议的单一组群集体,在比较的组群集的组合体的制约是更好的办法。