PackKV：通过LLM感知的有损压缩减少KV缓存内存占用 (PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression)

Transformer-based large language models (LLMs) have demonstrated remarkable potential across a wide range of practical applications. However, long-context inference remains a significant challenge due to the substantial memory requirements of the key-value (KV) cache, which can scale to several gigabytes as sequence length and batch size increase. In this paper, we present \textbf{PackKV}, a generic and efficient KV cache management framework optimized for long-context generation. %, which synergistically supports both latency-critical and throughput-critical inference scenarios. PackKV introduces novel lossy compression techniques specifically tailored to the characteristics of KV cache data, featuring a careful co-design of compression algorithms and system architecture. Our approach is compatible with the dynamically growing nature of the KV cache while preserving high computational efficiency. Experimental results show that, under the same and minimum accuracy drop as state-of-the-art quantization methods, PackKV achieves, on average, \textbf{153.2}\% higher memory reduction rate for the K cache and \textbf{179.6}\% for the V cache. Furthermore, PackKV delivers extremely high execution throughput, effectively eliminating decompression overhead and accelerating the matrix-vector multiplication operation. Specifically, PackKV achieves an average throughput improvement of \textbf{75.7}\% for K and \textbf{171.7}\% for V across A100 and RTX Pro 6000 GPUs, compared to cuBLAS matrix-vector multiplication kernels, while demanding less GPU memory bandwidth. Code available on https://github.com/BoJiang03/PackKV

翻译：基于Transformer的大语言模型（LLM）已在广泛的实际应用中展现出卓越潜力。然而，长上下文推理仍然是一个重大挑战，这主要源于键值（KV）缓存巨大的内存需求——随着序列长度和批处理规模的增加，其内存占用可扩展至数GB。本文提出\textbf{PackKV}，一个专为长上下文生成优化的通用高效KV缓存管理框架。PackKV引入了专门针对KV缓存数据特性设计的新型有损压缩技术，其特点在于压缩算法与系统架构的精心协同设计。该方法兼容KV缓存动态增长的特性，同时保持了较高的计算效率。实验结果表明，在与最先进的量化方法相同且最小的精度损失条件下，PackKV平均实现了K缓存\textbf{153.2}\%更高的内存降低率和V缓存\textbf{179.6}\%更高的内存降低率。此外，PackKV提供了极高的执行吞吐量，有效消除了解压缩开销并加速了矩阵-向量乘法运算。具体而言，与cuBLAS矩阵-向量乘法核函数相比，PackKV在A100和RTX Pro 6000 GPU上平均实现了K计算\textbf{75.7}\%和V计算\textbf{171.7}\%的吞吐量提升，同时所需GPU内存带宽更少。代码发布于https://github.com/BoJiang03/PackKV