With the growing popularity of deep-learning models, model understanding becomes more important. Much effort has been devoted to demystify deep neural networks for better interpretability. Some feature attribution methods have shown promising results in computer vision, especially the gradient-based methods where effectively smoothing the gradients with reference data is key to a robust and faithful result. However, direct application of these gradient-based methods to NLP tasks is not trivial due to the fact that the input consists of discrete tokens and the "reference" tokens are not explicitly defined. In this work, we propose Locally Aggregated Feature Attribution (LAFA), a novel gradient-based feature attribution method for NLP models. Instead of relying on obscure reference tokens, it smooths gradients by aggregating similar reference texts derived from language model embeddings. For evaluation purpose, we also design experiments on different NLP tasks including Entity Recognition and Sentiment Analysis on public datasets as well as key feature detection on a constructed Amazon catalogue dataset. The superior performance of the proposed method is demonstrated through experiments.
翻译:随着深层学习模式越来越受欢迎,模型理解变得日益重要。许多努力都致力于解开深层神经网络的神秘性,以便更好的解释性。有些特征归属方法在计算机视觉方面显示了令人乐观的成果,特别是梯度法,在这种方法中,有效平滑梯度,提供参考数据,是稳健和忠实的结果的关键。然而,这些梯度法直接应用于国家定位方案的任务并非微不足道,因为输入是由离散的象征物构成的,而且“参考”符号没有明确界定。在这项工作中,我们提议了本地集成功能属性(LAFA),这是NLP模型的新型梯度特征归属法。它不是依靠模糊的参考符号,而是通过汇总从语言模型嵌入的类似参考文本来平滑梯度。为了评价目的,我们还设计了不同的国家定位方案任务实验,包括公共数据集的实体识别和感应变分析,以及建筑的亚马孙目录数据集的关键特征检测。通过实验来证明拟议方法的优异性表现。