Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. By finding the layer and head configuration sufficient to solve the task, then performing ablation experiments and representation analysis, we show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. They also exploit shared computation across related tasks. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies in tasks requiring structured behavior.
翻译:在自然语言处理和机器愿景方面,变压器网络取得了巨大成功,下一个单词预测和图像分类等任务目标从高维投入的细微背景敏感度中受益。然而,关于变压器如何和何时获得高度结构化的行为并实现系统化的概括化,目前存在一场辩论。在这里,我们探索因果变压器如何很好地执行一系列算法任务,包括复制、分类和这些操作的等级构成。我们展示了比培训所用更长时间的顺序的强烈概括性。我们用在变压器中通常使用的标准位置编码替换了在变压器中通常使用的标准位置编码,其标签与序列中的物品任意配对。通过找到足以解决任务的层和头部配置,然后进行减缩实验和代表分析,我们展示了两层变压器学习多层次问题的一般性解决方案,并开发出系统性任务分解信号。它们还利用了相关任务的共同计算方法。这些结果提供了关键洞察力,说明变压器模型如何能够将复杂决定分解为需要结构化的任务中的可重复的多层次政策。