Recent work on interpretability has focused on concept-based explanations, where deep learning models are explained in terms of high-level units of information, referred to as concepts. Concept learning models, however, have been shown to be prone to encoding impurities in their representations, failing to fully capture meaningful features of their inputs. While concept learning lacks metrics to measure such phenomena, the field of disentanglement learning has explored the related notion of underlying factors of variation in the data, with plenty of metrics to measure the purity of such factors. In this paper, we show that such metrics are not appropriate for concept learning and propose novel metrics for evaluating the purity of concept representations in both approaches. We show the advantage of these metrics over existing ones and demonstrate their utility in evaluating the robustness of concept representations and interventions performed on them. In addition, we show their utility for benchmarking state-of-the-art methods from both families and find that, contrary to common assumptions, supervision alone may not be sufficient for pure concept representations.
翻译:最近关于可解释性的工作侧重于基于概念的解释,深层次的学习模式以高层次的信息单位来解释,称为概念;但是,概念学习模式表明,在它们的表现中容易对杂质进行编码,无法充分捕捉其投入的有意义的特征;概念学习缺乏衡量这种现象的尺度,但分解学习领域探索了数据差异的基本因素的相关概念,并有大量衡量这些因素纯度的尺度;在本文件中,我们表明,这类指标不适合概念学习,并提出了评价两种方法中概念表述纯度的新指标;我们展示了这些衡量标准优于现有指标的优势,并展示了它们在评价概念表述和针对它们采取的干预措施的稳健性方面的有用性;此外,我们展示了它们对于衡量家庭双方最新方法的实用性,发现与通常的假设相反,光靠监督本身可能不足以进行纯洁的概念表述。