Abstractive summarization models typically generate content unfaithful to the input, thus highlighting the significance of evaluating the faithfulness of generated summaries. Most faithfulness metrics are only evaluated on news domain, can they be transferred to other summarization tasks? In this work, we first present a systematic study of faithfulness metrics for dialogue summarization. We evaluate common faithfulness metrics on dialogue datasets and observe that most metrics correlate poorly with human judgements despite performing well on news datasets. Given these findings, to improve existing metrics' performance on dialogue summarization, we first finetune on in-domain dataset, then apply unlikelihood training on negative samples, and show that they can successfully improve metric performance on dialogue data. Inspired by the strong zero-shot performance of the T0 language model, we further propose T0-Score -- a new metric for faithfulness evaluation, which shows consistent improvement against baseline metrics across multiple domains.
翻译:抽象总结模型通常产生与输入内容不符的内容,从而突出评价生成摘要的忠实性的重要性。大多数忠诚度量度指标只在新闻领域评价,能否转移到其他总结任务?在这项工作中,我们首先对对话总结的忠诚度量标准进行系统研究;我们评价对话数据集的共同忠诚度量,发现大多数指标尽管在新闻数据集方面表现良好,却与人类判断不相符。鉴于这些调查结果,为了改进现有指标在对话总结方面的表现,我们首先对内部数据集进行微调,然后对负面样本进行不统一的培训,并表明它们能够成功地改进对话数据的衡量性能。在T0语言模型的强力零弹性表现的启发下,我们进一步提出“T0-Score”-一个新的忠诚度量度评估标准,它显示与多个领域的基线衡量标准一致改进。