There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training.
翻译:现有的通用机器翻译或自然语言生成评价指标存在若干问题,而问答(QA)系统在这方面是无差别的。为了建立强大的质量A系统,我们需要具备同等强的评价系统,以核实模型预测与问题是否类似于地面实况说明。比较基于语义的相似性而不是纯粹的字符串重叠的能力对于公平比较模型和表明现实应用中更现实的接受标准非常重要。我们利用我们的知识文件的第一份文件,利用基于变压器的模型来评估语义回答的相似性,并在没有词汇重叠的情况下实现与人类判断的更高关联性。我们建议使用跨编码增强双编码器和BERTS核心模型来进行语义回答的相似性。我们用由美裔公众人物配对组成的新数据集来培训。就我们而言,我们提供了首套共同参考名称字符串配对的数据集,以及可用于培训的相似性。