Transformers have become an important workhorse of machine learning, with numerous applications. This necessitates the development of reliable methods for increasing their transparency. Multiple interpretability methods, often based on gradient information, have been proposed. We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. We identify Attention Heads and LayerNorm as main reasons for such unreliable explanations and propose a more stable way for propagation through these layers. Our proposal, which can be seen as a proper extension of the well-established LRP method to Transformers, is shown both theoretically and empirically to overcome the deficiency of a simple gradient-based approach, and achieves state-of-the-art explanation performance on a broad range of Transformer models and datasets.
翻译:变形器已成为机器学习的重要工作马,其应用量很多,这就需要开发可靠的方法来提高透明度。提出了多种解释方法,这些方法往往以梯度信息为基础。我们显示变形器中的梯度只反映本地的函数,因此无法可靠地确定输入特征对预测的贡献。我们确定“注意头”和“图层”是造成这种解释不可靠的主要原因,并提出了通过这些层次传播的更稳定方式。我们的提议可以被视为成熟的LRP方法向变形器的适当延伸,从理论上和经验上看,都是为了克服简单的梯度方法的缺陷,并在广泛的变形模型和数据集方面实现最先进的解释性能。