We consider a semi-supervised $k$-clustering problem where information is available on whether pairs of objects are in the same or in different clusters. This information is either available with certainty or with a limited level of confidence. We introduce the PCCC algorithm, which iteratively assigns objects to clusters while accounting for the information provided on the pairs of objects. Our algorithm can include relationships as hard constraints that are guaranteed to be satisfied or as soft constraints that can be violated subject to a penalty. This flexibility distinguishes our algorithm from the state-of-the-art in which all pairwise constraints are either considered hard, or all are considered soft. Unlike existing algorithms, our algorithm scales to large-scale instances with up to 60,000 objects, 100 clusters, and millions of cannot-link constraints (which are the most challenging constraints to incorporate). We compare the PCCC algorithm with state-of-the-art approaches in an extensive computational study. Even though the PCCC algorithm is more general than the state-of-the-art approaches in its applicability, it outperforms the state-of-the-art approaches on instances with all hard constraints or all soft constraints both in terms of running time and various metrics of solution quality. The source code of the PCCC algorithm is publicly available on GitHub.
翻译:我们考虑的是半监督的美元集群问题,如果有关于对等物体是否属于同一物体或属于不同组群的信息,我们考虑的是半监督的美元集群问题。这种信息要么是肯定的,要么是信任程度有限的。我们引入了PCCC算法,这种算法反复地将物体分配给集群,同时核算在对等物体上提供的信息。我们的算法可以将各种关系作为保证能够满足的硬性限制或可受到处罚的软性限制。这种灵活性将我们的算法与所有对等限制都被认为是硬的或被认为软的先进方法区别开来。与现有的算法不同,我们的算法尺度与有多达60,000个对象、100个组群和数百万个无法连接的限制的大型情况不同。我们在广泛的计算研究中将PCCC算法与最先进的方法相比较。尽管PCCC算法比其适用性的最新方法更为笼统,但它超越了所有具有硬性限制或所有软性质量限制的状态方法。