Two of the most fundamental challenges in Natural Language Understanding (NLU) at present are: (a) how to establish whether deep learning-based models score highly on NLU benchmarks for the 'right' reasons; and (b) to understand what those reasons would even be. We investigate the behavior of reading comprehension models with respect to two linguistic 'skills': coreference resolution and comparison. We propose a definition for the reasoning steps expected from a system that would be 'reading slowly', and compare that with the behavior of five models of the BERT family of various sizes, observed through saliency scores and counterfactual explanations. We find that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the 'right' information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.
翻译:目前,在自然语言理解(NLU)中,最根本的挑战是:(a) 如何确定深层次学习模式是否基于“正确”的原因在NLU基准中得分很高;以及(b) 理解这些原因甚至会是什么。我们调查了两种语言“技能”的理解模式的行为:共同参照分辨率和比较。我们为一个“缓慢阅读”的系统所期望的推理步骤提出了一个定义,并将这一定义与BERT家族五种不同大小的模型的行为进行比较,这五种模型通过突出的分数和反事实解释观察到。我们发现,为了比较(但并非共同参照),基于大编码器的系统更有可能依赖“正确”信息,但即使它们也与一般化挣扎,表明它们仍然学习具体的词汇模式,而不是一般的比较原则。