The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. Dziri et al. (2022)'s investigation of hallucinations has revealed that existing knowledge-grounded benchmarks are contaminated with hallucinated responses at an alarming level (>60% of the responses) and models trained on this data amplify hallucinations even further (>80% of the responses). To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as a training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 21.1 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.
翻译:寻求信息对话的目标是对以知识来源为基础的自然语言发声的寻寻者询问做出回应。然而,对话系统往往产生不支持的发声,这是一种被称为幻觉的现象。Dziri等人(2022年)对幻觉的调查显示,现有的知识基础基准被在令人震惊的水平(超过答复的60%)和根据这一数据培训的模型中出现的幻觉所污染(占答复的80%)甚至进一步放大幻觉。为了减轻这一行为,我们采用了以数据为中心的解决方案,并创建了无幻觉对话的新基准,即无幻觉对话的新基准。我们观察到,信仰Dial比WoW(2022年)的幻觉调查显示,现有知识基础基准在令人震惊的水平(超过答复的60%)和关于这一数据的培训信号中可以起到一种培训信号作用:i) 幻觉批评,它会区分言词是否可信与否,并提升BEGIN比现有的数据集更加一致的成绩;ii) 高品质对话生成了一个无幻觉的美觉反应的新基准,我们用一些经过训练的准确性模型来衡量我们所了解的准确性数据水平。