PIMphony：克服基于PIM的长上下文LLM推理系统中的带宽与容量低效问题 (PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System)

Hyucksung Kwon,Kyungmo Koo,Janghyeon Kim,Woongkyu Lee,Minjae Lee,Gyeonggeun Jung,Hyungdeok Lee,Yousub Jung,Jaehan Park,Yosub Song,Byeongsu Yang,Haerang Choi,Guhyun Kim,Jongsoon Won,Woojae Shin,Changhyun Kim,Gyeongcheol Shin,Yongkee Kwon,Ilkon Kim,Euicheol Lim,John Kim,Jungwook Choi

from arxiv, 21 pages, 20 figures, Accepted to 2026 IEEE International Symposium on High-Performance Computer Architecture

The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to long contexts: severe channel underutilization, performance-limiting I/O bottlenecks, and massive memory waste from static KV cache management. In this work, we propose PIMphony, a PIM orchestrator that systematically resolves these issues with three co-designed techniques. First, Token-Centric PIM Partitioning (TCP) ensures high channel utilization regardless of batch size. Second, Dynamic PIM Command Scheduling (DCS) mitigates the I/O bottleneck by overlapping data movement and computation. Finally, a Dynamic PIM Access (DPA) controller enables dynamic memory management to eliminate static memory waste. Implemented via an MLIR-based compiler and evaluated on a cycle-accurate simulator, PIMphony significantly improves throughput for long-context LLM inference (up to 72B parameters and 1M context length). Our evaluations show performance boosts of up to 11.3x on PIM-only systems and 8.4x on xPU+PIM systems, enabling more efficient deployment of LLMs in real-world long-context applications.

翻译：长上下文大语言模型（LLMs）的扩展带来了严峻的内存系统挑战。尽管存内计算（PIM）是一种前景广阔的加速器，但我们发现其在扩展至长上下文时存在关键的低效问题：严重的通道利用率不足、限制性能的I/O瓶颈，以及静态KV缓存管理导致的大量内存浪费。本文提出PIMphony，一种通过三项协同设计技术系统性地解决这些问题的PIM编排器。首先，面向令牌的PIM分区（TCP）确保无论批大小如何都能实现高通道利用率。其次，动态PIM指令调度（DCS）通过重叠数据移动与计算来缓解I/O瓶颈。最后，动态PIM访问（DPA）控制器实现了动态内存管理，以消除静态内存浪费。通过基于MLIR的编译器实现并在周期精确模拟器上评估，PIMphony显著提升了长上下文LLM推理（高达720亿参数和100万上下文长度）的吞吐量。我们的评估显示，在纯PIM系统上性能最高提升11.3倍，在xPU+PIM系统上最高提升8.4倍，从而能够在实际长上下文应用中更高效地部署LLMs。