We propose a synthetic task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the transformer architecture learns this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps merely due to being a smart initialization rather than some deep knowledge stored in the network. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's ability to generalize to simple variants of the main task, and moreover we find that one can prevent such shortcut with appropriate architecture modification or careful data preparation. Motivated by our findings, we begin to explore the task of learning to execute C programs, where a convolutional modification to transformers, namely adding convolutional structures in the key/query/value maps, shows an encouraging edge.
翻译:我们提议了一个合成任务,即 " 学习平等和团体行动 " (LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)(LEGO)),它囊括了遵循推理链的问题,我们研究了变压器结构是如何完成这项任务的,我们研究了变压器结构如何(在某种程度上)引起关注以及网络中的信息流动。我们根据这些观察,我们提出了一个假设,即这里的变压器只是因为是一个聪明的初始化过程,而不是储存在网络中的一些深层次知识。我们还注意到,在一些数据制度中,受过训练的变压器找到“短”的方法来遵循推理链,这妨碍了模型能够概括化为主要任务的简单变压器,而且我们发现,人们能够(在某种程度上)通过适当的结构修改或谨慎的数据准备来防止这种捷径。我们提出的假设是,在这里,我们提出的是仅仅因为是一个聪明的初始的初始化,而不是储存在网络中的深层次上,我们开始探索一个关键的变压式结构的变压。