Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .
翻译:长上下文大语言模型推理受限于二次注意力复杂度和线性键值缓存增长。先前方法通过事后选择或驱逐来缓解此问题,但忽视了根本的低效性:不加区分地向持久内存写入。本文中,我们将键值缓存管理形式化为由三个基本操作构成的因果系统:键值准入、选择与驱逐。我们通过写入门控键值缓存实现键值准入,这是一种轻量级机制,能在令牌进入缓存前预测其效用。该方法通过早期过滤低效用状态以维持紧凑的全局缓存和滑动局部缓存,在Llama模型上实现了46-57%的内存使用降低,以及3.03-3.45倍的预填充加速和1.89-2.56倍的解码加速,且精度损失可忽略,同时保持与FlashAttention和分页键值系统的兼容性。这些结果表明,学习写入内容是实现高效长上下文推理的原则性实用方案。代码发布于https://github.com/EMCLab-Sinica/WG-KV。