Pattern discovery in multidimensional data sets has been the subject of research for decades. There exists a wide spectrum of clustering algorithms that can be used for this purpose. However, their practical applications share a common post-clustering phase, which concerns expert-based interpretation and analysis of the obtained results. We argue that this can be the bottleneck in the process, especially in cases where domain knowledge exists prior to clustering. Such a situation requires not only a proper analysis of automatically discovered clusters but also conformance checking with existing knowledge. In this work, we present Knowledge Augmented Clustering (KnAC). Its main goal is to confront expert-based labelling with automated clustering for the sake of updating and refining the former. Our solution is not restricted to any existing clustering algorithm. Instead, KnAC can serve as an augmentation of an arbitrary clustering algorithm, making the approach robust and a model-agnostic improvement of any state-of-the-art clustering method. We demonstrate the feasibility of our method on artificially, reproducible examples and in a real life use case scenario. In both cases, we achieved better results than classic clustering algorithms without augmentation.
翻译:数十年来,在多维数据集中发现模式一直是研究的主题。在这项工作中,存在着可以用于这一目的的多种多样的群集算法(KnAC),但是,它们的实际应用有一个共同的群集后阶段,涉及对所获结果的专家解释和分析。我们争辩说,这可能是这一过程的瓶颈,特别是在集群之前存在域知识的情况下。这种情况不仅需要对自动发现的群集进行适当分析,而且需要与现有知识进行核对。我们在此工作中介绍知识增强群集(KnAC),其主要目的是为了更新和完善前者而用自动群集标签对抗专家的标签。我们的解决办法不限于任何现有的群集算法。相反,KnAC可以作为任意的群集算法的增强因素,使该方法更加有力,并且对任何最先进的群集方法进行模型-认知性改进。我们展示了我们关于人为的、可复制的实例和真实生活使用案例假设的方法的可行性。在这两种情况下,我们取得的结果都比没有增强的经典群集算法要好。