There are several issues with the existing general machine translation or natural language generation evaluation metrics, and question-answering (QA) systems are indifferent in that context. To build robust QA systems, we need the ability to have equivalently robust evaluation systems to verify whether model predictions to questions are similar to ground-truth annotations. The ability to compare similarity based on semantics as opposed to pure string overlap is important to compare models fairly and to indicate more realistic acceptance criteria in real-life applications. We build upon the first to our knowledge paper that uses transformer-based model metrics to assess semantic answer similarity and achieve higher correlations to human judgement in the case of no lexical overlap. We propose cross-encoder augmented bi-encoder and BERTScore models for semantic answer similarity, trained on a new dataset consisting of name pairs of US-American public figures. As far as we are concerned, we provide the first dataset of co-referent name string pairs along with their similarities, which can be used for training. Machine Learning & Applications 4th International Conference on Machine Learning & Applications (CMLA 2022) June 25~26, 2022, Copenhagen, Denmark Volume Editors : David C. Wyld, Dhinaharan Nagamalai (Eds) ISBN : 978-1-925953-69-5
翻译:现有的通用机器翻译或自然语言生成评价指标存在若干问题,而问答(QA)系统在这方面是无差别的。为了建立强有力的质量A系统,我们需要具备相当强大的评估系统,以核实模型预测与问题是否类似于地面真相说明。比较基于语义的相似性而不是纯粹的字符串重叠的能力对于公平比较模型和表明现实应用中更现实的接受标准非常重要。我们利用我们的知识文件第一张,利用基于变压器的模型衡量语言答案相似性,并在没有词汇重叠的情况下实现与人类判断的更高相关性。我们建议跨编码增强双编码和BERTS核心模型,用于类似语义回答。我们受过由美国公众人物配对组成的新数据集培训。就我们而言,我们提供了首套共译名字符串配对数据集,以及可用于培训的相似性。机器学习和应用第4届国际标准化会议(CDM 2022), 机器学习第25届(NA-B-B)IS IS IS IS IS IS IS 2022, NA-DA 2022, CD-D 20-D IS-NA IS IS IS IS 2022, IS IS IS IS IS IS IS IS IS IS IS IS 2022,CD IS IS IS IS IS IS IS IS IS IS 20-L II)