Widespread adoption of deep models has motivated a pressing need for approaches to interpret network outputs and to facilitate model debugging. Instance attribution methods constitute one means of accomplishing these goals by retrieving training instances that (may have) led to a particular prediction. Influence functions (IF; Koh and Liang 2017) provide machinery for doing this by quantifying the effect that perturbing individual train instances would have on a specific test prediction. However, even approximating the IF is computationally expensive, to the degree that may be prohibitive in many cases. Might simpler approaches (e.g., retrieving train examples most similar to a given test point) perform comparably? In this work, we evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples. We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods (such as IFs), but that nonetheless exhibit desirable characteristics similar to more complex attribution methods. Code for all methods and experiments in this paper is available at: https://github.com/successar/instance_attributions_NLP.
翻译:由于广泛采用深层模型,迫切需要采用解释网络产出和推动示范调试的方法,因此迫切需要解释网络产出并促进示范调试。通过检索(可能已经)导致特定预测的培训实例,确定归属的方法是实现这些目标的一种手段。影响功能(IF;Koh和Liang 2017)通过量化干扰单个列车实例对具体测试预测的影响,为实现这一目标提供了机制。然而,即使接近综合框架也是在计算上昂贵的,其程度在很多情况下可能令人望而却步。也许比较简单的方法(例如,检索与特定测试点最相似的列车实例)可以进行比较?在这项工作中,我们评估不同潜在属性在培训样本重要性方面达成一致的程度。我们发现,简单检索方法产生的培训实例与基于梯度的方法(如IFs)所查明的不一样,但尽管存在与更为复杂的归属方法相似的可取特征。本文中所有方法和实验的代码都载于:https://github.com/sucsocentar/instinent_atritionmental_atritionations_NLPPs。