Clustering is a well-known unsupervised machine learning approach capable of automatically grouping discrete sets of instances with similar characteristics. Constrained clustering is a semi-supervised extension to this process that can be used when expert knowledge is available to indicate constraints that can be exploited. Well-known examples of such constraints are must-link (indicating that two instances belong to the same group) and cannot-link (two instances definitely do not belong together). The research area of constrained clustering has grown significantly over the years with a large variety of new algorithms and more advanced types of constraints being proposed. However, no unifying overview is available to easily understand the wide variety of available methods, constraints and benchmarks. To remedy this, this study presents in-detail the background of constrained clustering and provides a novel ranked taxonomy of the types of constraints that can be used in constrained clustering. In addition, it focuses on the instance-level pairwise constraints, and gives an overview of its applications and its historical context. Finally, it presents a statistical analysis covering 307 constrained clustering methods, categorizes them according to their features, and provides a ranking score indicating which methods have the most potential based on their popularity and validation quality. Finally, based upon this analysis, potential pitfalls and future research directions are provided.
翻译:集束是一种众所周知的未经监督的机械学习方法,能够自动组合具有类似特点的离散情况。受限制的集束是一个半监督的扩展过程,在专家知识可用以表明可加以利用的限制因素时,可以用来作为这一进程的半监督的延伸。这些限制因素的众所周知的例子必须是相互联系的(表明两个情况属于同一群体),不能相互连接(两个情况肯定不属于同一群体)。多年来,受限制的集束的研究领域有了很大的发展,提出了各种各样的新算法和较先进的制约类型。然而,没有统一的概览可以很容易地了解各种可用的方法、制约和基准。为纠正这一点,本研究报告详细介绍了受限制的集束的背景,并对可用于受限制的集束的制约类型进行了新的分类分类。此外,它侧重于实例层面的对等制约,并概述了其应用和历史背景。最后,它提出了涵盖307个受限制的集束方法的统计分析,根据它们的特点对其进行分类,并提供了排名分级分,表明哪些方法最有可能以其可靠性和验证为基础,最终根据这种研究质量提供何种差位。</s>