The predictions of question answering (QA)systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with pre-defined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question. Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems. Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to x2.6.
翻译:问题解答(QA)系统的预测通常根据人工加注的一个或多个答案的有限数据集进行评估。这导致一个覆盖面限制,导致低估系统的真实性能,而且通常通过扩展精确匹配(EM)与预先确定的规则或象征性F1级测量法的(EM),处理范围限制,并通常通过扩展与预定义规则或象征性F1测量法的(EM)的超精确匹配(EM)处理。在本文件中,我们提出第一个系统化的概念和数据驱动分析,以审查象征性等同措施的缺点。为此,我们界定了对回答等(AE)的不对称概念,接受相当于或改进参考的答案,并公布多个质量解析系统为候选人编制的超过23k份的人文判断。通过对这些数据的仔细分析,我们揭示和量化F1测量方法的若干具体局限性,如对渐进性的错误印象,或对问题缺乏依赖性。由于为每个被评估的模型收集AEE说明的费用很高,我们学习了一种BERT匹配(BEM)的尺度来弥补这项任务。我们发现BEM的任务比QA要简单得多,我们发现BEM提供比F1的更精确的A,并且更精确地反映BEM6号的精确地反映B6号的实用性。最后显示BEM系统的实用性。