The growth of million-token LLMs exposes the scalability limits of inference systems, where the KVCache dominates memory usage and data transfer overhead. Recent offloading systems migrate the KVCache to CPU memory and incorporate top-k attention to reduce the volume of data transferred from the CPU, while further applying system-level optimizations such as on-GPU caching and prefetching to lower transfer overhead. However, they overlook the CPU bottleneck in three aspects: (1) substantial overhead of fine-grained dynamic cache management performed on the CPU side, (2) significant transfer overhead from poor PCIe bandwidth utilization caused by heavy gathering operations at the CPU side, and (3) GPU runtime bubbles introduced by coarse-grained CPU-centric synchronization. To address these challenges, we propose CLO, a CPU-light KVCache offloading system via algorithm-system co-design. CLO features: (1) a coarse-grained head-wise approximate on-GPU caching strategy with negligible cache management cost, (2) seamless combination of data prefetching and on-GPU persistent caching for lower transfer overhead, (3) a zero-copy transfer engine to fully exploit PCIe bandwidth, and a GPU-centric synchronization method to eliminate GPU stalls. Evaluation on two widely-used LLMs demonstrates that CLO achieves comparable accuracy to state-of-the-art systems, while substantially minimizing CPU overhead, fully utilizing PCIe bandwidth, thus improving decoding throughput by 9.3%-66.6%. Our results highlight that algorithm-system co-design is essential for memory-constrained LLM inference on modern GPU platforms. We open source CLO at https://github.com/CommediaJW/CLO.
翻译:百万令牌级大语言模型的发展暴露了推理系统的可扩展性极限,其中键值缓存(KVCache)主导了内存使用和数据传输开销。近期的卸载系统将KVCache迁移至CPU内存,并引入Top-k注意力机制以减少从CPU传输的数据量,同时进一步应用系统级优化(如GPU端缓存和预取)以降低传输开销。然而,这些系统在三个方面忽视了CPU瓶颈:(1)CPU端细粒度动态缓存管理带来的显著开销,(2)CPU端密集聚合操作导致PCIe带宽利用率低下而产生的巨大传输开销,以及(3)以CPU为中心的粗粒度同步引入的GPU运行时气泡。为应对这些挑战,我们提出了CLO——一种通过算法-系统协同设计实现的CPU轻量级KVCache卸载系统。CLO具备以下特点:(1)采用粗粒度的头级近似GPU端缓存策略,其缓存管理成本可忽略不计;(2)无缝结合数据预取与GPU端持久化缓存以降低传输开销;(3)通过零拷贝传输引擎充分利用PCIe带宽,并采用以GPU为中心的同步方法消除GPU停滞。在两个广泛使用的大语言模型上的评估表明,CLO在达到与先进系统相当精度的同时,显著降低了CPU开销,充分利用了PCIe带宽,从而将解码吞吐量提升了9.3%至66.6%。我们的结果强调,算法-系统协同设计对于现代GPU平台上内存受限的大语言模型推理至关重要。CLO已在https://github.com/CommediaJW/CLO开源。