Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accuracy on in-distribution test examples while failing to generalize to other data distributions over the exact same problem space. Our study provides an explanation for this paradox: instead of learning to emulate the correct reasoning function, BERT has in fact learned statistical features that inherently exist in logical reasoning problems. We also show that it is infeasible to jointly remove statistical features from data, illustrating the difficulty of learning to reason in general. Our result naturally extends to other neural models and unveils the fundamental difference between learning to reason and learning to achieve high performance on NLP benchmarks using statistical features.
翻译:在一系列广泛的NLP任务中需要逻辑推理。 一个 BERT 模型能否通过培训最终到最终解决自然语言提出的逻辑推理问题? 我们试图在一个有限的问题空间里回答这个问题,因为那里存在一系列完全模拟逻辑推理的参数。 我们的观察似乎自相矛盾: BERT在分布测试实例上取得了近乎完美的准确性,而没有在完全相同的问题空间中推广到其他数据分布。 我们的研究为这一悖论提供了解释:BERT没有学习模仿正确的推理功能,而是学习了逻辑推理问题中固有的统计特征。 我们还表明,联合从数据中去除统计特征是行不通的,表明学习一般道理的困难。我们的结果自然延伸到其他神经模型,并揭示了学习理性与学习利用统计特征在NLP基准上取得高性表现之间的根本差异。