Natural Language Inference (NLI) or Recognizing Textual Entailment (RTE) is the task of predicting the entailment relation between a pair of sentences (premise and hypothesis). This task has been described as a valuable testing ground for the development of semantic representations, and is a key component in natural language understanding evaluation benchmarks. Models that understand entailment should encode both, the premise and the hypothesis. However, experiments by Poliak et al. revealed a strong preference of these models towards patterns observed only in the hypothesis, based on a 10 dataset comparison. Their results indicated the existence of statistical irregularities present in the hypothesis that bias the model into performing competitively with the state of the art. While recast datasets provide large scale generation of NLI instances due to minimal human intervention, the papers that generate them do not provide fine-grained analysis of the potential statistical patterns that can bias NLI models. In this work, we analyze hypothesis-only models trained on one of the recast datasets provided in Poliak et al. for word-level patterns. Our results indicate the existence of potential lexical biases that could contribute to inflating the model performance.
翻译:自然语言推断(NLI)或确认文本细节(RTE)是预测一对判决(假设和假设)之间必然存在的关系的任务。这项任务被描述为发展语义表达的一种宝贵的试验场,是自然语言理解评价基准的一个关键组成部分。理解要求的模型应该将前提和假设都编码起来。然而,Poliak等人的实验显示,这些模型非常倾向于只根据10个数据集比较而假设所观察到的模式。其结果表明,假设中存在的统计违规现象使模型偏向于与艺术状态竞争。重新构建的数据集由于人类的干预程度最小,提供了大规模生成国家语言表达实例,但产生这些数据集的文件并没有对可能偏向国家语言分类模式的潜在统计模式进行精确分析。在这项工作中,我们分析了在Poliak等人为文字层次模式提供的重编数据集中经过培训的单一假设模型。我们的结果表明,存在潜在的词法偏见,可能助长模型的形成。