The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to long contexts: severe channel underutilization, performance-limiting I/O bottlenecks, and massive memory waste from static KV cache management. In this work, we propose PIMphony, a PIM orchestrator that systematically resolves these issues with three co-designed techniques. First, Token-Centric PIM Partitioning (TCP) ensures high channel utilization regardless of batch size. Second, Dynamic PIM Command Scheduling (DCS) mitigates the I/O bottleneck by overlapping data movement and computation. Finally, a Dynamic PIM Access (DPA) controller enables dynamic memory management to eliminate static memory waste. Implemented via an MLIR-based compiler and evaluated on a cycle-accurate simulator, PIMphony significantly improves throughput for long-context LLM inference (up to 72B parameters and 1M context length). Our evaluations show performance boosts of up to 11.3x on PIM-only systems and 8.4x on xPU+PIM systems, enabling more efficient deployment of LLMs in real-world long-context applications.
翻译:长上下文大语言模型(LLMs)的扩展带来了严峻的内存系统挑战。尽管存内计算(PIM)是一种前景广阔的加速器,但我们发现其在扩展至长上下文时存在关键的低效问题:严重的通道利用率不足、限制性能的I/O瓶颈,以及静态KV缓存管理导致的大量内存浪费。本文提出PIMphony,一种通过三项协同设计技术系统性地解决这些问题的PIM编排器。首先,面向令牌的PIM分区(TCP)确保无论批大小如何都能实现高通道利用率。其次,动态PIM指令调度(DCS)通过重叠数据移动与计算来缓解I/O瓶颈。最后,动态PIM访问(DPA)控制器实现了动态内存管理,以消除静态内存浪费。通过基于MLIR的编译器实现并在周期精确模拟器上评估,PIMphony显著提升了长上下文LLM推理(高达720亿参数和100万上下文长度)的吞吐量。我们的评估显示,在纯PIM系统上性能最高提升11.3倍,在xPU+PIM系统上最高提升8.4倍,从而能够在实际长上下文应用中更高效地部署LLMs。