Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/google/BEGIN-dataset.
翻译:由大型语言模型驱动的基于知识的对话系统往往产生反应,这些反应虽然流畅,但不能归属于相关的信息来源。在不展示这一问题的模型方面取得进展,要求有能够量化其普遍性的评价指标。为此,我们引入了“对基层内分化的评价基准”(BEGIN),由在三个基于知识的对话群中培训的神经对话系统产生的12k对话旋转组成。我们收集了人文说明,评估了这些模型的响应在多大程度上可归属于所提供的背景资料。我们随后使用BEGIN来分析八项评价指标。我们发现,这些指标依赖虚假的关联性,不可靠地区分可归来的抽象反应与不可归属的对应反应,在知识源更长时则执行得更差得多。我们的调查结果强调,需要更复杂和有力的评价指标用于知识型对话。我们在 https://github.com/google/BEGIN-dataset上公布了BEGIN。