Dialogue summarization is abstractive in nature, making it suffer from factual errors. The factual correctness of summaries has the highest priority before practical applications. Many efforts have been made to improve faithfulness in text summarization. However, there is a lack of systematic study on dialogue summarization systems. In this work, we first perform the fine-grained human analysis on the faithfulness of dialogue summaries and observe that over 35% of generated summaries are faithfully inconsistent respective the source dialogues. Furthermore, we present a new model-level faithfulness evaluation method. It examines generation models with multi-choice questions created by rule-based transformations. Experimental results show that our evaluation schema is a strong proxy for the factual correctness of summarization models. The human-annotated faithfulness samples and the evaluation toolkit are released to facilitate future research toward faithful dialogue summarization.
翻译:对话总结在性质上是抽象的,它受到事实错误的影响。摘要的实际正确性在实际应用之前具有最高优先地位。已经作出许多努力来提高文本总结的忠诚性。但是,对对话总结系统缺乏系统的研究。在这项工作中,我们首先对对话摘要的忠实性进行细微的人性分析,并观察到所生成摘要的35%以上忠实于源对话。此外,我们提出了一个新的模型级忠诚性评价方法。它用基于规则的转变产生的多种选择问题来审查生成模型。实验结果显示,我们的评价计划是总结模型事实正确性的有力替代物。人性说明忠诚性样本和评价工具包已经发布,以促进今后关于忠实对话总结的研究。