Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To mitigate the issue, we explore a series of hybrid models that restore direct access to past tokens. We interleave token mixers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable token eviction approach. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent tokens to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention mechanisms are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
翻译:将整个输入序列压缩为固定大小循环状态的线性注意力模型为Transformer提供了一种高效替代方案,但其有限记忆容量导致的遗忘现象会损害检索密集型任务的性能。为缓解此问题,我们探索了一系列恢复对历史令牌直接访问能力的混合模型。我们在计算复杂度介于线性注意力与完全注意力之间的中间层交错插入令牌混合器,包括采用令牌淘汰机制的稀疏注意力,以及查询感知的本地稀疏注意力。特别地,我们提出了一种新颖的可学习令牌淘汰方法。结合滑动窗口注意力机制,一个端到端可训练的轻量级CNN通过聚合相邻历史与未来令牌的信息,自适应地为每个注意力头保留有限的关键键值对集合,同时保持线性注意力的常数时间与空间复杂度。我们为稀疏注意力机制提供了高效的Triton内核实现。在检索密集型基准测试上的实证评估验证了我们方法的有效性。