Supervised classification can be effective for prediction but sometimes weak on interpretability or explainability (XAI). Clustering, on the other hand, tends to isolate categories or profiles that can be meaningful but there is no guarantee that they are useful for labels prediction. Predictive clustering seeks to obtain the best of the two worlds. Starting from labeled data, it looks for clusters that are as pure as possible with regards to the class labels. One technique consists in tweaking a clustering algorithm so that data points sharing the same label tend to aggregate together. With distance-based algorithms, such as k-means, a solution is to modify the distance used by the algorithm so that it incorporates information about the labels of the data points. In this paper, we propose another method which relies on a change of representation guided by class densities and then carries out clustering in this new representation space. We present two new algorithms using this technique and show on a variety of data sets that they are competitive for prediction performance with pure supervised classifiers while offering interpretability of the clusters discovered.
翻译:受监督的分类可以有效预测,但有时在可解释性或可解释性方面却比较弱( XAI ) 。 另一方面, 分组往往分离出可能有意义的类别或剖面, 但无法保证这些类别或剖面对标签预测有用。 预测性组群试图获得两个世界中最好的数据。 从标签性组群开始, 它寻找的组群在类类标签方面尽可能纯洁。 一种技术是调整组群算法, 使数据点共享同一标签的算法能够合并在一起。 以距离为基础的算法, 如 k 运算法, 一种解决办法是修改算法使用的距离, 以便纳入关于数据点标签的信息。 在本文中, 我们提出另一种方法, 依靠类别密度的描述方式, 并在这一新的代表空间中进行分组。 我们使用这一技术提出了两种新的算法, 并展示了各种数据集, 它们在预测与纯粹受监督的分类员的预测性时具有竞争力, 同时提供所发现组群群的可解释性 。