Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of processing. A step in the staircase comprises of backward tokens (encoding the sequence so far seen) and forward tokens (ingesting a new part of the sequence), or an extreme Ladder version with a forward step of zero that simply repeats the Transformer on each step of the ladder, sharing the weights. We thus describe a family of such models that can trade off performance and compute, by either increasing the amount of recurrence through time, the amount of sequential processing via recurrence in depth, or both. Staircase attention is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. Further, it is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains.
翻译:关注机制已成为序列建模任务的标准工具, 特别是在变换器结构中, 在整个输入序列中堆叠自我注意层, 成为一个标准工具 。 在这项工作中, 我们引入了一种新的关注程序, 叫做楼梯注意, 与自我注意不同, 在序列( 时间) 中反复处理输入, 并增加另一个处理步骤 。 楼梯中的一个步骤由后向符号( 编码迄今为止所看到的序列) 和前向符号( 输入序列的新部分) 组成, 或者一个极梯形版本, 向前一步零, 只需在梯子的每步重复变换器, 并共享重量 。 因此, 我们描述出这样的模型的组合, 它可以通过时间、 通过深度重现来增加重现的顺序处理数量, 或者两者同时进行计算。 显示, 楼梯注意能够解决一些任务, 这些任务涉及跟踪传统变换器由于这种重现而无法完成的任务 。 此外, 显示它能为相同的大小模型( 参数数) 提供改进的模型能力, 相对于大型变换模式和大幅变换式对话框的模型任务, 。