Attention mechanisms have played a crucial role in the development of complex architectures such as Transformers in natural language processing. However, Transformers remain hard to interpret and are considered as black-boxes. This paper aims to assess how attention coefficients from Transformers can help in providing interpretability. A new attention-based interpretability method called CLaSsification-Attention (CLS-A) is proposed. CLS-A computes an interpretability score for each word based on the attention coefficient distribution related to the part specific to the classification task within the Transformer architecture. A human-grounded experiment is conducted to evaluate and compare CLS-A to other interpretability methods. The experimental protocol relies on the capacity of an interpretability method to provide explanation in line with human reasoning. Experiment design includes measuring reaction times and correct response rates by human subjects. CLS-A performs comparably to usual interpretability methods regarding average participant reaction time and accuracy. The lower computational cost of CLS-A compared to other interpretability methods and its availability by design within the classifier make it particularly interesting. Data analysis also highlights the link between the probability score of a classifier prediction and adequate explanations. Finally, our work confirms the relevancy of the use of CLS-A and shows to which extent self-attention contains rich information to explain Transformer classifiers.
翻译:摘要:注意力机制在自然语言处理中的Transformer等复杂体系结构的发展中发挥了关键作用。然而,Transformer仍然难以解释,被认为是黑盒子。本文旨在评估来自Transformer的注意力系数如何帮助提供解释性。提出了一种名为分类-注意力(CLS-A)的新型基于注意力的解释方法。CLS-A根据Transformer体系结构中特定于分类任务的部分相关的注意力系数分布计算每个词的解释性评分。进行了人类响应实验来评估和比较CLS-A与其他解释性方法。实验协议依赖于解释性方法能够提供符合人类推理的解释的能力。实验设计包括测量人类受试者的反应时间和正确率。CLS-A在平均参与者反应时间和准确性方面表现与通常的解释性方法相当。CLS-A相对于其他解释性方法的较低计算成本,并且由于设计在分类器中易于使用,因此特别有趣。数据分析还强调了分类器预测概率分数和适当解释之间的联系。最后,我们的工作证实了使用CLS-A的相关性,并展示自我注意力包含丰富的信息以解释Transformer分类器。