Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has to be stored mostly in the same element-wise representations. Moreover, the length of an input sequence is limited by quadratic computational complexity of self-attention. In this work, we propose and study a memory-augmented segment-level recurrent Transformer (Recurrent Memory Transformer). Memory allows to store and process local and global information as well as to pass information between segments of the long sequence with the help of recurrence. We implement a memory mechanism with no changes to Transformer model by adding special memory tokens to the input or output sequence. Then Transformer is trained to control both memory operations and sequence representations processing. Results of experiments show that our model performs on par with the Transformer-XL on language modeling for smaller memory sizes and outperforms it for tasks that require longer sequence processing. We show that adding memory tokens to Tr-XL is able to improve it performance. This makes Recurrent Memory Transformer a promising architecture for applications that require learning of long-term dependencies and general purpose in memory processing, such as algorithmic tasks and reasoning.
翻译:以变换器为基础的模型显示其在多个领域和任务中的有效性。 自我注意允许将所有序列元素的信息整合到上下文意识的表达中。 但是, 全球和地方信息必须主要存储在同一个元素的表达中。 此外, 输入序列的长度受到自我注意的四进制计算复杂性的限制。 在这项工作中, 我们提议并研究一个内存增强的部位常规变换器( 内存内存变换器) 。 内存允许存储和处理本地和全球信息, 以及将长序列各部分间的信息传递到有助于重现。 我们通过在输入或输出序列中添加特殊的内存符号来实施不改变变换器模型的内存机制。 然后, 变换器被训练以控制内存操作和序列表达处理。 实验结果显示, 我们的模型在语言模型与变换器- XL 相比, 用于较小的内存内存大小, 超越它需要较长序列处理的任务。 我们显示, 将内存符号添加到 Tr- XL 能够改进它的性。 这使得 内存变器变换器转换器操作中的一种有希望的架构, 以及长期的演算。