Language models (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis, which often require processing multiple long-context inputs. Running an LM locally on-device improves privacy, enables offline use, and reduces cost, but long-context inference quickly hits a \emph{memory capacity wall} as the key-value (KV) cache grows linearly with context length and batch size. We present KVSwap, a software framework to break this memory wall by offloading the KV cache to non-volatile secondary storage (disk). KVSwap leverages the observation that only a small, dynamically changing subset of KV entries is critical for generation. It stores the full cache on disk, uses a compact in-memory metadata to predict which entries to preload, overlaps computation with hardware-aware disk access, and orchestrates read patterns to match storage device characteristics. Our evaluation shows that across representative LMs and storage types, KVSwap delivers higher throughput under tight memory budgets while maintaining the generation quality when compared with existing KV cache offloading schemes.
翻译:语言模型(LMs)是新兴移动与嵌入式人工智能应用(如会议与视频摘要、文档分析)的核心支撑,这类应用常需处理多个长上下文输入。在设备端本地运行语言模型可提升隐私性、支持离线使用并降低成本,但长上下文推理会迅速遭遇“内存容量墙”——因为键值(KV)缓存随上下文长度与批处理规模线性增长。本文提出KVSwap,一种通过将KV缓存卸载至非易失性二级存储(磁盘)来突破此内存墙的软件框架。KVSwap基于一项关键观察:仅有少量动态变化的KV条目对生成过程至关重要。该框架将完整缓存存储于磁盘,利用紧凑的内存元数据预测需预加载的条目,实现计算与硬件感知磁盘访问的重叠,并协调读取模式以匹配存储设备特性。评估结果表明,在典型语言模型与存储类型下,相较于现有KV缓存卸载方案,KVSwap在严格内存约束下能提供更高吞吐量,同时保持生成质量。