Device-memory management is a key bottleneck for serving large language models (LLMs) on accelerators whose memory has poor small-granularity random-access bandwidth (e.g., LPDDR5-class). Existing approaches either statically pre-allocate worst-case KV-cache per request, wasting substantial device memory, or rely on fine-grained paging that assumes high random-access tolerance and is therefore ill-suited to LPDDR-style systems. We present ODMA, an on-demand memory allocation framework for LLM serving on random-access-constrained device memory (RACM) platforms such as LPDDR5-based Cambricon MLUs. ODMA builds on generation-length prediction while addressing distribution drift and heavy-tailed request lengths via dynamic bucket partitioning and a large-bucket safeguard: bucket boundaries are periodically re-learned from online histograms, and high-uncertainty or overflowed requests fall back to a reserved large bucket for robustness. On Alpaca and Google-NQ, ODMA improves S3's predictor accuracy from 98.60% to 99.55% and from 82.68% to 93.36%, respectively. Serving DeepSeek-R1-Distill-Qwen-7B on four Cambricon MLU370-X4 accelerators, ODMA increases device-memory utilization from 55.05% to 72.45% on Alpaca and from 42.54% to 61.79% on Google-NQ, and boosts throughput by 23% and 27% over a static pre-allocation baseline. These results show that predictor-driven, hardware-aware allocation can unlock efficient LLM serving on RACM accelerators without hardware changes, complementing paging-centric designs tailored to HBM systems.
翻译:设备内存管理是制约大语言模型在具有较差小粒度随机访问带宽的加速器(如LPDDR5类)上进行服务的关键瓶颈。现有方法要么静态地为每个请求预分配最坏情况下的KV缓存,导致大量设备内存浪费;要么依赖细粒度分页机制,该机制假设系统具有高随机访问容忍度,因此并不适合LPDDR风格的系统。本文提出ODMA,一种面向随机访问受限设备内存平台(如基于LPDDR5的寒武纪MLU)上大语言模型服务的按需内存分配框架。ODMA建立在生成长度预测的基础上,同时通过动态桶分区和大桶保护机制应对分布漂移和重尾请求长度问题:桶边界定期从在线直方图中重新学习,而高不确定性或溢出的请求则回退到预留的大桶以保证鲁棒性。在Alpaca和Google-NQ数据集上,ODMA分别将S3预测器的准确率从98.60%提升至99.55%,以及从82.68%提升至93.36%。在四块寒武纪MLU370-X4加速器上服务DeepSeek-R1-Distill-Qwen-7B模型时,ODMA将设备内存利用率在Alpaca上从55.05%提高至72.45%,在Google-NQ上从42.54%提高至61.79%,并且相比静态预分配基线分别提升了23%和27%的吞吐量。这些结果表明,基于预测器驱动且硬件感知的分配策略能够在无需硬件改动的情况下,在随机访问受限的加速器上实现高效的大语言模型服务,从而与为HBM系统设计的以分页为中心的方法形成互补。