Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to persistent memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV, a lightweight mechanism that learns to predict token utility before it enters the cache. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, Write-Gated KV reduces memory usage by 46-57% and delivers 3.03-3.45$\times$ prefill and 1.89-2.56$\times$ decode speedups on Llama model with negligible accuracy loss, all while remaining compatible with FlashAttention and paged-KV systems. These results demonstrate that learning what to write, is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV .
翻译:长上下文LLM推理受限于二次注意力复杂度和线性KV缓存增长。先前方法通过事后选择或驱逐来缓解此问题,但忽视了根本的低效性:不加区分地向持久内存写入。本文中,我们将KV缓存管理形式化为由三个原语构成的因果系统:KV准入、选择和驱逐。我们通过写入门控KV缓存实例化KV准入机制,这是一种轻量级方法,可在令牌进入缓存前预测其效用。通过早期过滤低效用状态以维持紧凑的全局缓存和滑动局部缓存,写入门控KV缓存在Llama模型上将内存使用降低46-57%,实现3.03-3.45倍的预填充加速和1.89-2.56倍解码加速,且精度损失可忽略,同时保持与FlashAttention和分页KV系统的兼容性。这些结果表明,学习写入内容是实现高效长上下文推理的原则性实用方案。代码发布于https://github.com/EMCLab-Sinica/WG-KV。