Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a Token Importance Recurrence phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%~70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.
翻译:大型语言模型(LLMs)通过思维链推理展现出增强的能力。然而,由于键值(KV)缓存的增加,延长的推理序列会带来显著的GPU内存开销。现有的KV缓存压缩方法虽能缓解内存瓶颈,但在长链推理任务中表现不佳。本文分析了推理任务中的注意力模式,揭示了**令牌重要性重现现象**:大部分令牌在经过多个解码步骤后会重新获得高注意力,而现有方法未能捕捉该现象,可能导致对这些周期性关键令牌的不可预测淘汰。为此,我们提出LazyEviction,一种基于观测窗口的延迟淘汰框架,该框架通过根据令牌重现模式进行优先级淘汰,从而保留潜在的重现令牌。大量实验表明,LazyEviction在保持相当准确性的同时,将KV缓存减少50%~70%,优于现有的KV缓存压缩基线方法。我们的实现代码可在 https://github.com/Halo-949/LazyEviction 找到。