Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not, a task known as answer verification. In this work, we benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC. We find that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others. However, our experiments reveal that improved verification performance does not necessarily translate to overall QA-based metric quality: In some scenarios, using a worse verification method -- or using none at all -- has comparable performance to using the best verification method, a result that we attribute to properties of the datasets.
翻译:以问答为基础的概括性评价指标必须自动确定质量评估模型的预测是否正确,这是一项被称为回答性核查的任务。 在这项工作中,我们衡量了当前基于质量评估的指标以及两种更复杂的文本比较方法(BERTScore和LERRC)所使用的词汇回答性核查方法。我们发现,LECR在某些环境下优于其他方法,但在统计上仍然无法与其它情况下的逻辑重叠相区别。然而,我们的实验显示,改进的核查性能并不一定能转化为基于质量评估的总的衡量质量:在某些情况下,使用更差的核查方法 -- -- 或根本没有使用任何方法 -- -- 具有与使用最佳核查方法相当的性能,我们把这种性能归因于数据集的特性。