Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.
翻译:Transformer模型虽然强大,但存在二次计算复杂度和注意力机制中不断增长的键值(KV)缓存问题。本文提出Trellis,一种具有有限内存的新型Transformer架构,能够在测试时动态学习压缩其键值记忆。Trellis用固定大小的记忆体替代标准KV缓存,并训练一个双通道循环压缩机制,将新键值存储到记忆体中。为实现这一目标,它采用带有遗忘门的在线梯度下降过程,使压缩记忆能够递归更新,同时学习在测试时保留来自输入标记的重要上下文信息。在语言建模、常识推理、记忆密集型任务和时间序列上的大量实验表明,所提出的架构优于强基线模型。值得注意的是,其性能增益随序列长度增加而提升,突显了其在长上下文应用中的潜力。