One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.
翻译:在这项研究中,我们重新解释(Maao等人,2021年)中建议的不协调过度信任常规化目标,将其作为一种幻觉风险衡量方法,以更好地估计生成摘要的质量。我们提议了一个无参考指标,HaRiM+,它只需要一个现成的汇总模型来计算基于象征性可能性的幻觉风险。部署它不需要对模型或特设模块进行额外培训,这些模块通常需要与人类判断保持一致。关于摘要质量估计,HaRiM+记录了与人类对三套摘要质量批注(FRANK、QAGS和SummEval)的判断的最新相关关系。我们希望,我们值得使用汇总模型的工作能够促进自动化评估和生成摘要的进展。