Data movement between the main memory and the processor is a key contributor to execution time and energy consumption in memory-intensive applications. This data movement bottleneck can be alleviated using Processing-in-Memory (PiM). One category of PiM is Processing-using-Memory (PuM), in which computation takes place inside the memory array by exploiting intrinsic analog properties of the memory device. PuM yields high performance and energy efficiency, but existing PuM techniques support a limited range of operations. As a result, current PuM architectures cannot efficiently perform some complex operations (e.g., multiplication, division, exponentiation) without large increases in chip area and design complexity. To overcome these limitations of existing PuM architectures, we introduce pLUTo (processing-using-memory with lookup table (LUT) operations), a DRAM-based PuM architecture that leverages the high storage density of DRAM to enable the massively parallel storing and querying of lookup tables (LUTs). The key idea of pLUTo is to replace complex operations with low-cost, bulk memory reads (i.e., LUT queries) instead of relying on complex extra logic. We evaluate pLUTo across 11 real-world workloads that showcase the limitations of prior PuM approaches and show that our solution outperforms optimized CPU and GPU baselines by an average of 713$\times$ and 1.2$\times$, respectively, while simultaneously reducing energy consumption by an average of 1855$\times$ and 39.5$\times$. Across these workloads, pLUTo outperforms state-of-the-art PiM architectures by an average of 18.3$\times$. We also show that different versions of pLUTo provide different levels of flexibility and performance at different additional DRAM area overheads (between 10.2% and 23.1%). pLUTo's source code is openly and fully available at https://github.com/CMU-SAFARI/pLUTo.
翻译:主存储器和处理器之间的数据移动是执行时间和存储密集应用中能源消耗的一个关键因素。 这个数据移动瓶颈可以通过处理存储器( PiM) 来缓解。 PiM 的一个类别是处理使用存储器( PuM), 通过利用存储器的内在模拟属性在存储器内进行计算。 Pum 产生高性能和能效,但现有的 PuM 技术支持了有限的操作范围。 结果, 当前的 PumM 结构无法有效完成一些复杂的操作( 例如, 倍增、 分化、 Expentientiment ), 而没有大幅增加芯片区域和设计的复杂性。 为了克服现有的 PumM 结构中的这些限制, 我们引入了 PLUT( 使用D- Memory), 一个基于 DRAM 的高级存储器密度, 使 DRAM 的高度存储器密度能够实现大规模平行存储和查询调值表( LUTs ) 。 PLUTo 的关键想法是用低成本、 批量存储器 和快速存储器的系统, 显示一个不同版本。