Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.
翻译:大型语言模型(LLM)在处理长序列任务中表现出色,这加强了对键值(KV)缓存的需求。尽管近期针对KV缓存淘汰的研究缓解了推理负担,但这些方法往往未能根据不同注意力模式在各层级间合理分配资源。本文提出级联自适应KV缓存淘汰机制(CAKE),该方法将KV缓存淘汰问题构建为“分蛋糕问题”。CAKE通过考量注意力在空间与时间维度上的动态特性,评估各层级的特异性偏好,据此为各层分配合理的缓存容量,并以级联方式管理内存约束。该方法实现了缓存分配的全局视角,在维持内存预算的同时,能自适应地在不同注意力机制间调配资源。CAKE还采用了一种新的淘汰指标,该指标考虑了token随时间推移的重要性变化,从而解决了现有方法忽视时序动态特性的局限。在LongBench与NeedleBench上的综合实验表明,CAKE仅需3.2%的KV缓存即可保持模型性能,并在不同模型与内存约束条件下持续优于现有基线方法,尤其在低内存配置下表现突出。此外,当使用FlashAttention-2处理128K token的上下文时,CAKE相比全缓存实现了超过10倍的解码延迟加速。代码已开源:https://github.com/antgroup/cakekv。