Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness.
翻译:尽管变换者在NLP任务上取得了广泛成功,但最近的工作发现,他们很难在与经常模式相比时模拟几种正式语言,这就提出了一个问题,即变换者为什么在实践中表现良好,他们是否具有任何能使其比经常模式更普遍化的特性。 在这项工作中,我们对布林函数进行了广泛的经验研究,以表明:(一) 随机变换者相对偏向于低敏感度的功能。 (二) 在对布林函数进行培训时,变换者和LSTMS都优先考虑低敏感度的学习功能,而变换者最终会与低敏感度的功能相融合。 (三) 关于稀有的布林功能,我们发现,变换者在噪音标签下几乎完全可以概括,而LSTMS超合适,并且实现不准确性差。总体而言,我们的结果提供了有力的可量化证据,表明变换者和经常模式在诱导的偏向性偏差方面存在差异,尽管表达性相对有限,但可能有助于解释变换者的有效概括性表现。