The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics. To this end, we create an English and a German three-way annotated evaluation dataset containing pairs of answers along with human judgment of their semantic similarity, which we release along with an implementation of the SAS metric and the experiments. We find that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics on our two newly created datasets and one dataset from related work.
翻译:问题解答模型的评估将地面真实性说明与模型预测进行比较。然而,截至今天,这一比较大多以词汇为基础,因此遗漏了没有在词汇上重叠、但仍具有语义相似性的答复,因此将正确的答案视为假的。这种对模型真实性表现的低估妨碍了用户在应用中的接受,并使对不同模型的公平比较复杂化。因此,需要基于语义而不是纯字符串相似性的评价性指标。在这份简短的文件中,我们提出SAS,一种基于跨编码的指数,用以估计语义回答相似性,并将它与七个现有的指标进行比较。为此,我们创建了一套英文和德国的附加注释的评价数据集,其中包含了两种答案的配对及其语义相似性的人类判断,我们同时发布了SAS指标和实验。我们发现,基于最近变异模型的语义相似性指标与人类判断的关系要好得多,而不是与我们两个新创建的数据集和一份相关工作中的数据相关的传统类似性指标。