We propose a synthetic reasoning task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the Transformer architectures learn this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we manage to understand some of the attention heads as well as how the information flows in the network. In particular, we have identified a novel \emph{association} pattern that globally attends only to identical tokens. Based on these observations we propose a hypothesis that here pretraining helps for LEGO tasks due to certain structured attention patterns, and we experimentally verify this hypothesis. We also observe that in some data regime the trained transformer finds ``shortcut" solutions to follow the chain of reasoning, which impedes the model's robustness, and moreover we propose ways to prevent it. Motivated by our findings on structured attention patterns, we propose the LEGO attention module, a drop-in replacement for vanilla attention heads. This architectural change significantly reduces Flops and maintains or even \emph{improves} the model's performance at large-scale pretraining.
翻译:我们提议了一个合成推理任务,即LEGO(追求平等和集团业务),它包罗了遵循推理链的问题,我们研究了变异结构是如何学会这项任务的。我们特别注意数据效应,如预培训(似乎无关的NLP任务)和数据集构成(例如培训和测试时间的链长不同),以及建筑变体,如加权制层或添加变体成分。我们研究经过培训的模型如何最终成功地完成任务,特别是,我们设法理解一些关注负责人以及网络信息流动的方式。特别是,我们确定了一个新的全球只关注相同标志的\emph{Association}模式。基于这些观察,我们提出一个假设,即由于某些结构化的注意模式,预培训有助于LEGO的任务,我们实验性地核查这一假设。我们还注意到,在一些数据制度中,经过培训的变异体发现“甚至短期”的解决方案可以遵循逻辑链,从而妨碍模型的稳健性,而且我们提议了一种全球关注的新型模式。我们提议了在结构化的模型中大幅关注。