With the rapid development of machine learning, improving its explainability has become a crucial research goal. We study the problem of making the clusters more explainable by investigating the cluster descriptors. Given a set of objects $S$, a clustering of these objects $\pi$, and a set of tags $T$ that have not participated in the clustering algorithm. Each object in $S$ is associated with a subset of $T$. The goal is to find a representative set of tags for each cluster, referred to as the cluster descriptors, with the constraint that these descriptors we find are pairwise disjoint, and the total size of all the descriptors is minimized. In general, this problem is NP-hard. We propose a novel explainability model that reinforces the previous models in such a way that tags that do not contribute to explainability and do not sufficiently distinguish between clusters are not added to the optimal descriptors. The proposed model is formulated as a quadratic unconstrained binary optimization problem which makes it suitable for solving on modern optimization hardware accelerators. We experimentally demonstrate how a proposed explainability model can be solved on specialized hardware for accelerating combinatorial optimization, the Fujitsu Digital Annealer, and use real-life Twitter and PubMed datasets for use cases.
翻译:随着机器学习的迅速发展,改进其解释性已成为一个关键的研究目标。我们研究如何通过调查集描述器来使组群更能解释。我们研究如何通过调查群集描述器来使组群更能解释的问题。鉴于一组对象为$S$,这些对象的组合为$pi$,以及一组没有参与群集算法的标记为$T$。每个以美元计的物体都与一组美元相联。每个组群的一组目标都与一组美元相联。目标是为每个组群寻找一套代表性的标签,称为群集描述器,其局限性是我们发现这些描述器是双向脱节的,而所有描述器的总尺寸也最小化了。一般来说,这个问题是NP-硬的。我们提出了一个新的解释性模型,它强化了以前的模型,使那些无助于解释性的标记与组群集之间没有足够区别。 最佳的解码仪表没有被添加到一个子组。拟议模型被设计成一个四面式的、不协调的二进式优化问题,因此适合解决现代优化硬件加速器的问题。我们实验性地展示了如何用一个真正的解释性模型,在加速的硬质微软硬件上可以解决。