Recent developments in machine learning have introduced models that approach human performance at the cost of increased architectural complexity. Efforts to make the rationales behind the models' predictions transparent have inspired an abundance of new explainability techniques. Provided with an already trained model, they compute saliency scores for the words of an input instance. However, there exists no definitive guide on (i) how to choose such a technique given a particular application task and model architecture, and (ii) the benefits and drawbacks of using each such technique. In this paper, we develop a comprehensive list of diagnostic properties for evaluating existing explainability techniques. We then employ the proposed list to compare a set of diverse explainability techniques on downstream text classification tasks and neural network architectures. We also compare the saliency scores assigned by the explainability techniques with human annotations of salient input regions to find relations between a model's performance and the agreement of its rationales with human ones. Overall, we find that the gradient-based explanations perform best across tasks and model architectures, and we present further insights into the properties of the reviewed explainability techniques.
翻译:机械学习的近期发展引进了以建筑复杂性增加为代价对待人类绩效的模式; 努力使模型预测所依据的原理具有透明度,从而激发了大量新的解释技术; 有了经过训练的模型,它们计算了一个输入实例的词的突出分数; 但是,对于以下几个方面没有明确的指南:(一) 如何选择这种技术,以特定应用任务和模型结构为特点,以及(二) 使用每一种技术的利弊; 在本文件中,我们为评价现有解释技术而制定了一套全面的诊断性特性清单; 然后,我们使用拟议的清单,比较下游文本分类任务和神经网络结构的一套不同解释性技术; 我们还将解释性技术所分配的显著分数与突出输入区域的人类说明进行比较,以找出模型的性能及其原理与人类理论的一致之间的关系。 总之,我们发现,基于梯度的解释在各项任务和模型结构中表现得最佳,我们进一步洞察到经过审查的可解释技术的特性。