Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
翻译:变异器已被成功应用到相继、自动递减的任务中,尽管这些变异器是向前的网络。 与经常性神经网络不同,变异器在平行处理输入符号时,使用注意力来捕捉时间关系。 虽然这种平行化使这些变异器具有计算效率,但它限制了模型的计算效率,使其无法充分利用输入的顺序性质。 在一个特定层次上的代表只能接触下层的代表, 而不是现有的更高层次的代表。 在这项工作中, 我们提议了反馈变异器结构, 将所有先前的表达方式暴露在所有未来的表达方式中, 这意味着当前时间步骤中的最低代表来自历史的最高层抽象代表。 我们展示了语言建模、 机器翻译和强化方面的多种基准。 我们学习到, 增加的演化能力可以创建比可比变异器更强大的小型、 浅度模型。