Machine learning models can reach high performance on benchmark natural language processing (NLP) datasets but fail in more challenging settings. We study this issue when a pre-trained model learns dataset artifacts in natural language inference (NLI), the topic of studying the logical relationship between a pair of text sequences. We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference (SNLI) corpus. We study the stylistic pattern of dataset artifacts in the SNLI. To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks: a behavioral testing checklist at the sentence level and lexical synonym criteria at the word level. Specifically, our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
翻译:机器学习模型在基准自然语言处理(NLP)数据集方面可以达到很高的性能,但在更具挑战性的环境中却失败了。当一个预先培训的模型在自然语言推论中学习数据集文物(NLI),即研究一对文本序列之间的逻辑关系的主题时,我们研究这个问题。我们提供多种技术,用于分析和查找来自众源的斯坦福自然语言推论(SNLI)系统中的数据集文物。我们研究了SNLI中数据集文物的立体模式。为了减少数据集文物,我们采用了一种独特的多尺度数据增强技术,有两个不同的框架:判决一级的行为测试清单和字面一级的词汇同义标准。具体地说,我们的组合方法提高了我们的模型对扰动测试的耐性,使其能够持续超过预先培训的基线。