Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer module, in which we use pre-trained models from the existing literature, and therefore, our metric can be used without further training. We show that RQUGE has a higher correlation with human judgment without relying on the reference question. RQUGE is shown to be significantly more robust to several adversarial corruptions. Additionally, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on the synthetic data generated by a question generation model and re-ranked by RQUGE.
翻译:用于评价自动产生的问题的质量的现有衡量标准,如BLEU、ROUGE、BERTScore和BLEURT等自动产生的问题的质量,比较参考和预测的问题,当候选人和参考问题之间有很大的词汇重叠或语义相似时,提供高分。这种方法有两个主要缺陷。首先,我们需要昂贵的人力提供参考问题。第二,它惩罚可能与参考问题没有高度的词汇或语义相似的有效问题。在本文件中,我们根据背景情况对候选人问题的可回答性提出新的衡量标准,RQUGE。该衡量标准包括一个问答和测距计分模块,在其中我们使用从现有文献中预先培训的模型,因此,我们的衡量标准可以不经进一步培训而使用。我们表明,RQUGE在不依赖参考问题的情况下,与人类判断的相关性更高。RQUGE对若干对抗性腐败表现出极大的活力。此外,我们说明,我们可以通过对合成数据生成的合成问题进行微调,大大改进QAA模型的外部数据生成情况。