面向异构LLM工作负载的类别感知语义缓存 (Category-Aware Semantic Caching for Heterogeneous LLM Workloads)

LLM serving systems process heterogeneous query workloads where different categories exhibit different characteristics. Code queries cluster densely in embedding space while conversational queries distribute sparsely. Content staleness varies from minutes (stock data) to months (code patterns). Query repetition patterns range from power-law (code) to uniform (conversation), producing long tail cache hit rate distributions: high-repetition categories achieve 40-60% hit rates while low-repetition or volatile categories achieve 5-15% hit rates. Vector databases must exclude the long tail because remote search costs (30ms) require 15--20% hit rates to break even, leaving 20-30% of production traffic uncached. Uniform cache policies compound this problem: fixed thresholds cause false positives in dense spaces and miss valid paraphrases in sparse spaces; fixed TTLs waste memory or serve stale data. This paper presents category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category. We present a hybrid architecture separating in-memory HNSW search from external document storage, reducing miss cost from 30ms to 2ms. This reduction makes low-hit-rate categories economically viable (break-even at 3-5% versus 15-20%), enabling cache coverage across the entire workload distribution. Adaptive load-based policies extend this framework to respond to downstream model load, dynamically adjusting thresholds and TTLs to reduce traffic to overloaded models by 9-17% in theoretical projections.

翻译：LLM服务系统处理异构查询工作负载，其中不同类别展现出不同的特征。代码查询在嵌入空间中密集聚类，而对话查询则稀疏分布。内容陈旧度从分钟级（股票数据）到月级（代码模式）不等。查询重复模式从幂律分布（代码）到均匀分布（对话）变化，产生长尾缓存命中率分布：高重复类别达到40-60%的命中率，而低重复或易变类别仅达到5-15%的命中率。向量数据库必须排除长尾部分，因为远程搜索成本（30ms）需要15-20%的命中率才能实现盈亏平衡，导致20-30%的生产流量未被缓存。统一的缓存策略加剧了这一问题：固定阈值在密集空间中导致误报，在稀疏空间中遗漏有效复述；固定TTL则浪费内存或提供陈旧数据。本文提出类别感知语义缓存，其中相似度阈值、TTL和配额随查询类别动态调整。我们提出一种混合架构，将内存中的HNSW搜索与外部文档存储分离，将未命中成本从30ms降低至2ms。这一降低使得低命中率类别在经济上可行（盈亏平衡点从15-20%降至3-5%），从而实现对整个工作负载分布的缓存覆盖。基于自适应负载的策略扩展了该框架，以响应下游模型负载，动态调整阈值和TTL，在理论预测中将过载模型的流量减少9-17%。