SemShareKV：基于词元级LSH匹配的语义相似提示词高效KV缓存共享 (SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching)

As large language models (LLMs) continue to scale, the memory footprint of key-value (KV) caches during inference has become a significant bottleneck. Existing approaches primarily focus on compressing KV caches within a single prompt or reusing shared prefixes or frequently ocurred text segments across prompts. However, such strategies are limited in scenarios where prompts are semantically similar but lexically different, which frequently occurs in tasks such as multi-document summarization and conversational agents. We propose \textit{SemShareKV}, a KV cache sharing and compression framework that accelerates LLM inference by reusing KVCache in semantically similar prompts. Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information. By selectively reusing relevant key-value pairs from a reference prompt's cache, SemShareKV reduces redundant computation while maintaining output quality. Experiments on diverse summarization datasets show up to 6.25$\times$ speedup and 42\% lower GPU memory usage with 5k tokens input, with negligible quality degradation. These results highlight the potential of semantic-aware cache sharing for efficient LLM inference.

翻译：随着大型语言模型（LLM）规模的持续扩大，推理过程中键值（KV）缓存的内存占用已成为显著瓶颈。现有方法主要集中于压缩单个提示词内的KV缓存，或跨提示词复用共享前缀或频繁出现的文本片段。然而，在提示词语义相似但词汇表达不同的场景中，此类策略效果有限，这在多文档摘要和对话代理等任务中经常发生。我们提出\\textit{SemShareKV}，一种KV缓存共享与压缩框架，通过复用语义相似提示词的KVCache来加速LLM推理。SemShareKV不依赖精确的词元匹配，而是利用词元嵌入的局部敏感哈希（LSH）进行模糊词元匹配，并结合旋转位置编码（RoPE）以更好地保留位置信息。通过选择性复用参考提示词缓存中的相关键值对，SemShareKV在保持输出质量的同时减少了冗余计算。在多样化摘要数据集上的实验表明，在输入5k词元时，可实现高达6.25$\\times$的加速和42%的GPU内存使用降低，且质量下降可忽略不计。这些结果凸显了语义感知缓存共享在高效LLM推理中的潜力。