Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We search for the layer and head configuration sufficient to solve these tasks, then probe for signs of systematic processing in latent representations and attention patterns. We show that two-layer transformers learn reliable solutions to multi-level problems, develop signs of task decomposition, and encode input items in a way that encourages the exploitation of shared computation across related tasks. These results provide key insights into how attention layers support structured computation both within a task and across multiple tasks.
翻译:在自然语言处理和机器愿景方面,变压器网络取得了巨大成功,下一个单词预测和图像分类等任务目标从高维投入的细微背景敏感度中受益。然而,关于变压器如何以及何时能够获得高度结构化的行为并实现系统化的概括化,正在进行一场持续的辩论。在这里,我们探索因果变压器如何很好地执行一系列算法任务,包括复制、分类和这些操作的等级构成。我们展示了比培训所用时间长的顺序的强烈概括性,替换了在变压器中通常使用的标准位置编码,在变压器中使用的标签与序列中的物品任意配对。我们寻找足以解决这些任务的层和头部配置,然后探寻在潜在表现和关注模式中系统处理的迹象。我们显示,两层变压器能够学习对多层次问题的可靠解决方案,开发任务分解信号,并编码输入项目,从而鼓励在相关任务中利用共同计算。这些结果提供了关键洞察力层如何支持在任务中和多个任务中进行结构性计算。