Training data attribution (TDA) is concerned with understanding model behavior in terms of the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training. We reframe the problem in this "final-model-only" setting as one of measuring sensitivity of the model to training instances. To operationalize this reframing, we propose further training, with appropriate adjustment and averaging, as a gold standard method to measure sensitivity. We then unify existing gradient-based methods for TDA by showing that they all approximate the further training gold standard in different ways. We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.
翻译:训练数据归因(TDA)旨在通过训练数据理解模型行为。本文关注一种常见场景:研究者仅能访问最终训练完成的模型,而无法获取训练算法或训练过程中的中间信息。在此“仅基于最终模型”的场景下,我们将该问题重新定义为衡量模型对训练实例的敏感性。为实现这一重构,我们提出通过适当调整与平均的进一步训练作为衡量敏感性的黄金标准方法。随后,我们通过证明现有基于梯度的TDA方法均以不同方式近似该进一步训练的黄金标准,对这些方法进行了统一。我们通过表格数据、图像数据和文本数据集及模型,实证研究了这些基于梯度的近似方法相对于进一步训练的质量。研究发现,一阶方法的近似质量有时较高,但会随进一步训练量的增加而衰减;相比之下,影响函数方法提供的近似更为稳定,但其质量却意外较低。