Natural language inference (NLI) requires models to learn and apply commonsense knowledge. These reasoning abilities are particularly important for explainable NLI systems that generate a natural language explanation in addition to their label prediction. The integration of external knowledge has been shown to improve NLI systems, here we investigate whether it can also improve their explanation capabilities. For this, we investigate different sources of external knowledge and evaluate the performance of our models on in-domain data as well as on special transfer datasets that are designed to assess fine-grained reasoning capabilities. We find that different sources of knowledge have a different effect on reasoning abilities, for example, implicit knowledge stored in language models can hinder reasoning on numbers and negations. Finally, we conduct the largest and most fine-grained explainable NLI crowdsourcing study to date. It reveals that even large differences in automatic performance scores do neither reflect in human ratings of label, explanation, commonsense nor grammar correctness.
翻译:自然语言推断(NLI)要求模型学习和应用常识知识。这些推理能力对于可解释的NLI系统特别重要,这些系统除了标签预测外还产生自然语言解释。外部知识的整合已经证明可以改进NLI系统,我们在这里调查它是否也能提高解释能力。在这方面,我们调查外部知识的不同来源,评价我们关于内部数据以及旨在评估细微推理能力的特殊传输数据集的模型的性能。我们发现,不同的知识来源对推理能力有不同的影响,例如,语言模型中储存的隐性知识会阻碍数字和否定的推理。最后,我们进行了迄今为止最大和最精细的NLI众包研究。它显示,即使自动性能分数的差别也并不反映人类对标签、解释、常识或语法的准确性。