Language models (LMs) have been shown to memorize a great deal of factual knowledge contained in their training data. But when an LM generates an assertion, it is often difficult to determine where it learned this information and whether it is true. In this paper, we propose the problem of fact tracing: identifying which training examples taught an LM to generate a particular factual assertion. Prior work on training data attribution (TDA) may offer effective tools for identifying such examples, known as "proponents". We present the first quantitative benchmark to evaluate this. We compare two popular families of TDA methods -- gradient-based and embedding-based -- and find that much headroom remains. For example, both methods have lower proponent-retrieval precision than an information retrieval baseline (BM25) that does not have access to the LM at all. We identify key challenges that may be necessary for further improvement such as overcoming the problem of gradient saturation, and also show how several nuanced implementation details of existing neural TDA methods can significantly improve overall fact tracing performance.
翻译:语言模型(LMS)已被展示为记忆培训数据中包含的大量事实知识。 但是,当LM生成一个断言时,通常很难确定它从何处学到这一信息以及它是否属实。在本文中,我们提出事实追踪问题:确定哪些培训实例教授LM来产生一个具体的事实断言。先前关于培训数据归属的工作(TDA)可能为确定这类实例提供有效的工具,称为“信息方 ” 。我们提出了第一个评估这些实例的量化基准。我们比较了两个受欢迎的TDA方法组群 -- -- 梯度法和嵌入法 -- -- 并发现仍然有许多头部。例如,这两种方法比信息检索基线(BM25)的精确度都低,而信息检索基线根本无法使用LM。我们确定了可能需要的进一步改进的关键挑战,例如克服梯度饱和问题,并表明现有神经TDA方法的若干细微的执行细节如何显著改进总体的跟踪业绩。