pLUTO: 通过查找表格在 DRAM 中允许进行大规模平行计算 (pLUTo: Enabling Massively Parallel Computation In DRAM via Lookup Tables)

João Dinis Ferreira,Gabriel Falcao,Juan Gómez-Luna,Mohammed Alser,Lois Orosa,Mohammad Sadrosadati,Jeremie S. Kim,Geraldo F. Oliveira,Taha Shahroodi,Anant Nori,Onur Mutlu

Data movement between main memory and the processor is a key contributor to the execution time and energy consumption of memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high throughput and efficiency, but supports a limited range of operations. As a result, PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without sizeable increases in chip area and design complexity. To overcome this limitation in DRAM-based PuM architectures, we introduce pLUTo (processing-using-memory with lookup table [LUT] operations), a DRAM-based PuM architecture that leverages the high area density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The use of LUTs enables pLUTo to efficiently execute complex operations in-memory via memory reads (i.e., LUT queries) instead of relying on complex extra logic or performing long sequences of DRAM commands. pLUTo outperforms the optimized CPU and GPU baselines in performance/energy efficiency by an average of 1960$\times$/307$\times$ and 4.2$\times$/4$\times$ across the evaluated workloads, and by 33$\times$/8$\times$ and 110$\times$/80$\times$ for the LeNet-5 quantized neural network. pLUTo outperforms a state-of-the-art PiM baseline by 50$\times$/342$\times$ in performance/energy efficiency.

翻译：主内存和处理器之间的数据移动是50美元记忆密集型应用程序执行时间和能量消耗的一个关键因素。这个数据移动瓶颈可以通过处理存储器( PiM) 来缓解。 PiM 的一个类别是处理使用存储器( PuM), 通过利用内存设备的内在模拟属性在存储器内进行计算。 PumM 产生高传输量和效率,但支持有限的操作范围。因此, Pum 结构无法有效完成一些复杂的操作( 例如, 倍增、司、 Expentiment $ ), 而芯片区域和设计复杂度却不会大幅增加。为了克服基于 DRAM 的 PumyM 结构中的这一限制, 我们引入了 PLUTO( 处理使用- 使用存储表[ LUT] 操作的模拟), 一个基于 DRAM 高区域密度, 使搜索表的大规模平行存储和查询( LUTs) 使用 LUTPUT 以美元美元快速执行复杂操作操作, 通过IMLUT IM 的运行运行运行中, IMUT 运行运行一个运行的运行中, 运行中, 直径级的运行运行运行运行运行中, 运行运行中, 运行中, 运行中, 运行中, 运行中, 运行中运行中, 运行中运行中运行中运行中运行中运行中运行中运行中运行中运行中运行中, 。