确定内存中心计算机系统基准:实时处理硬件分析 (Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware)

from arxiv, Invited paper to appear at Workshop on Computing with Unconventional Technologies (CUT) 2021 https://sites.google.com/umn.edu/cut-2021/home. arXiv admin note: substantial text overlap with arXiv:2105.03814

Many modern workloads such as neural network inference and graph processing are fundamentally memory-bound. For such workloads, data movement between memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new technologies that integrate memory with a logic layer, where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. The UPMEM PIM architecture combines traditional DRAM memory arrays with general-purpose in-order cores, called DRAM Processing Units (DPUs), integrated in the same chip. This paper presents key takeaways from the first comprehensive analysis of the first publicly-available real-world PIM architecture. We provide four key takeaways about the UPMEM PIM architecture, which stem from our study. More insights about suitability of different workloads to the PIM system, programming recommendations for software designers, and suggestions and hints for hardware and architecture designers of future PIM systems are available in arXiv:2105.03814

翻译：许多现代工作量,如神经网络推断和图表处理,基本上都是记忆性的。对于这些工作量,记忆核心和CPU核心之间的数据流动在潜伏和能量方面都要求大量的间接费用。一个主要原因是,这种通信是通过一个内嵌高和带宽有限、在内存工作量中数据再利用程度低的窄公共汽车进行的,不足以分散内存访问的成本。从根本上解决这一数据移动瓶颈需要一种模式,即记忆系统通过整合处理能力在计算中发挥积极作用。这一模式被称为内处理(PIM)。最近的研究探索了不同形式的PIM结构,其原因是出现了将内存与逻辑层融合起来的新技术,而处理要素可以很容易地放置。过去的工作是在模拟中评价这些结构,或者最好使用简化的硬件原型。相比之下,UPMEM公司设计并制造了第一个向公众公开开放的PIMPIM结构。 UPMEM OFSerma 结构将传统的DRAM 与一般目的核心(PIM)的普通目的阵列,称为DRAMIM系统未来的关键直径分析系统。我们的主要IM-IM结构是Slodal 。