Despite recent progress, it has been difficult to prevent semantic hallucinations in generative Large Language Models. One common solution to this is augmenting LLMs with a retrieval system and making sure that the generated output is attributable to the retrieved information. Given this new added constraint, it is plausible to expect that the overall quality of the output will be affected, for example, in terms of fluency. Can scaling language models help? Here we examine the relationship between fluency and attribution in LLMs prompted with retrieved evidence in knowledge-heavy dialog settings. Our experiments were implemented with a set of auto-metrics that are aligned with human preferences. They were used to evaluate a large set of generations, produced under varying parameters of LLMs and supplied context. We show that larger models tend to do much better in both fluency and attribution, and that (naively) using top-k retrieval versus top-1 retrieval improves attribution but hurts fluency. We next propose a recipe that could allow smaller models to both close the gap with larger models and preserve the benefits of top-k retrieval while avoiding its drawbacks.
翻译:尽管最近取得了进展,但很难防止在基因型大语言模型中出现语义上的幻觉。这方面的一个共同解决办法是用一个检索系统扩大LLMS,确保生成的产出可归因于检索的信息。鉴于这一新的额外限制,似乎可以预期产出的总体质量会受到影响,例如,流畅性方面。缩放语言模型能帮助吗?我们在这里研究LLMS中流利和归属之间的关系,这些流利和归属是在知识重对话环境中以检索的证据促进的。我们实验是在一套符合人类偏好的自动测量方法下进行的。它们被用来评价在各种LLMS参数下产生的大量代人,并提供了背景。我们表明,较大的模型往往在流利性和归属性两方面都做得更好,而且(通常地)使用顶级检索和顶级检索可以改进归属,但会伤害流利性。我们随后建议一种配方,允许较小的模型既用较大的模型来缩小差距,又保留顶级检索的好处,同时避免其背。