We analyze two Natural Language Inference data sets with respect to their linguistic features. The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model. To this end, we also investigate the differences between a crowd-sourced, machine-translated data set (SNLI) and a collection of text pairs from internet sources. Our main findings are, that the model has difficulty recognizing the semantic importance of prepositions and verbs, emphasizing the importance of linguistically aware pre-training tasks. Furthermore, it often does not comprehend antonyms and homonyms, especially if those are depending on the context. Incomplete sentences are another problem, as well as longer paragraphs and rare words or phrases. The study shows that automated language understanding requires a more informed approach, utilizing as much external knowledge as possible throughout the training process.
翻译:我们根据语言特征分析两个自然语言推断数据集,目的是确定机器学习模型特别难以理解的合成和语义特性,为此,我们还调查由众源组成的机器翻译数据集(SNLI)和从互联网来源收集的文本配对之间的差异。我们的主要发现是,该模型难以认识到预设语和动词的语义重要性,强调语言意识上的培训前任务的重要性。此外,它往往不理解语义和同义词,特别是根据背景而定的语义和同义词。不完整的句子是另一个问题,较长的段落和稀有文字或短语。研究表明,自动语言理解需要一种更加知情的方法,在整个培训过程中尽可能多地利用外部知识。