Many modern workloads such as neural network inference and graph processing are fundamentally memory-bound. For such workloads, data movement between memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new technologies that integrate memory with a logic layer, where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. This paper presents key takeaways from the first comprehensive analysis of the first publicly-available real-world PIM architecture. We provide four key takeaways about the UPMEM PIM architecture, which stem from our study. More insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems are available in arXiv:2105.03814
翻译:许多现代工作负载例如神经网络推理和图处理都是基于内存的。对于这样的工作负载,内存和CPU核之间的数据移动会导致相当高的延迟和能耗开销。一个主要原因是该通信过程通过一个带宽有限的狭窄总线进行,且内存限制数据重用的低性能不能够摊销内存访问的成本。解决这种数据移动瓶颈的根本方法是采用一种内存系统在计算中承担主动角色的编程范式,即集成处理能力的处理中内存(PIM)。最近的研究探索了不同形式的PIM架构,这受到了新技术的推动,该技术将存储器与逻辑层集成起来,处理元件可以轻松放置。以往的作品在模拟中评估这些架构,而最好的作品则使用简化的硬件原型。相比之下,UPMEM公司已经设计并制造了第一个公开可用的真实世界PIM架构。 UPMEM PIM架构将传统的DRAM内存阵列与通用的顺序核,称为DRAM处理单元(DPUs),集成在同一芯片中。本文提供了第一个全面分析第一个公开可用的真实世界PIM架构的主要见解。 我们提供了关于UPMEM PIM体系结构的四个关键见解,该结构源于我们的研究。有关不同工作负载适合PIM系统的更多见解,软件设计人员的编程建议以及未来PIM系统的硬件和架构设计人员的建议和提示,请参见arXiv:2105.03814。