AEON: 自动评估非LP试验案例的方法 (AEON: A Method for Automatic Evaluation of NLP Test Cases)

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (e.g., a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e.g., grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for Automatic Evaluation Of NLP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON's potential in improving NLP software.

翻译：由于人工测试或触角构造的劳动密集型性质,人们提出了各种自动化测试技术,以提高自然语言处理软件(NLP)的可靠性。理论上,这些技术使现有的测试案例(例如带有标签的句子)发生变异,假设生成的测试案例保留了等效或类似的语义含义,因此也具有同样的标签。然而,在实践上,许多生成的测试案例未能保持相似的语义含义,而且不自然(如语法错误),导致错误警报率和异常测试案例。我们的评估研究发现,44%的测试案例导致现有测试案例(例如,带有标签的句子)的可靠性。从理论上讲,这些测试案例使现有的测试案例(例如,带有标签标签标签的句子)发生变异。这些测试案例需要广泛的手工检查,而不是改进NLP软件。然而,为了解决这一问题,我们建议对NLP测试案例进行自动评估的AON(AON)自动评估,每增加一个测试案例,根据语义相似性和语言自然特性产生分数。我们使用AEON模型来评估典型的测试案例,在四种人文测试中进行最佳的测试案例中,通过测试结果显示最佳测试结果。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日