Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in this setting. We propose and study Attributed QA as a key first step in the development of attributed LLMs. We develop a reproducable evaluation framework for the task, using human annotations as a gold standard and a correlated automatic metric that we show is suitable for development settings. We describe and benchmark a broad set of architectures for the task. Our contributions give some concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third key question (How to build LLMs with attribution?).
翻译:大型语言模型(LLMS)在各种任务中显示了令人印象深刻的成果,但很少需要或不需要直接监督。此外,有越来越多的证据表明,LLMS在信息搜索情景中可能具有潜力。我们认为,LLM有能力将它产生的文本归为对系统开发者和用户都可能至关重要的文本。我们提出并研究定性QA,作为开发配给LMS的关键第一步。我们为这项任务开发一个可复制的评价框架,使用人文说明作为黄金标准,用我们显示适合发展环境的相关自动衡量标准。我们描述并衡量一套广泛的任务架构。我们的贡献为两个关键问题提供了一些具体答案(如何衡量归属? 以及当前最先进的归属方法如何运行? ),并给出一些关于如何解决第三个关键问题的提示(如何用归属构建LMS? ) 。