Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (RMT). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then the model is trained to control both memory operations and sequence representations processing. Results of experiments show that RMT performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve its performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.
翻译:以变换器为基础的模型显示其在多个领域和任务中的有效性。自我注意使得能够将所有序列要素的信息整合到符合背景的表示中。然而,全球和地方信息必须主要存储在同一个元素的表示中。此外,输入序列的长度受到自我注意的四进制计算复杂性的限制。在这项工作中,我们提议并研究一个内存增强的部位级常规变换器(RMT),内存允许存储和处理本地和全球信息,并在有助于重现的情况下将长序列中各个部分的信息传递到不同部分之间。我们通过在输入或输出序列中添加特殊记忆符号,对变换器模型实施不改变的记忆机制。然后,该模型被训练以控制内存操作和序列表达处理。实验结果显示,RMT与变换-XL的语文模型相比,为较小的内存大小,超出它需要较长序列处理的任务。我们显示,将记忆符号添加到TR-XL能够改进它的性能。这样使内存变器转换器能够成为一个充满前景的架构,需要学习长期目的演算法。