SOT-MRAM的系统和设计技术协同优化，用于高性能人工智能加速器存储系统 (System and Design Technology Co-optimization of SOT-MRAM for High-Performance AI Accelerator Memory System)

SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity.

翻译：现在，在SoC中，它们自己具有AI加速器部分，以适应深度学习（DL）应用程序的不断增长的需求。这些加速器具有用于矩阵乘法的强大的MAC引擎，表现出高计算性能。但是，由于存储资源（即带宽和容量）有限，它们无法在大规模批量训练和推理期间实现最佳系统性能。在这项工作中，我们提出了一种具有高芯片容量和带宽的存储系统，以将AI加速器的档位从存储器限制区域转向实现系统级峰值性能。我们使用DTCO为大型芯片内存开发具有自定义SOT-MRAM的存储系统，通过STCO和对DL工作负载的详细特性分析。% 我们对CV和NLP基准测试中的工作负载感知存储系统进行评估，在推理和训练模式下与基于SRAM的比较，观察到显着的PPA改进。我们的工作负载感知存储系统在训练CV基准测试时实现8倍的能量和9倍的延迟改进，在训练NLP基准测试时实现8倍的能量和4.5倍的延迟改进，同时在同等容量下仅消耗约50％的SRAM面积。

相关内容

Performance

关注 3

Performance：International Symposium on Computer Performance Modeling, Measurements and Evaluation。 Explanation：计算机性能建模、测量和评估国际研讨会。 Publisher：ACM。 SIT：http://dblp.uni-trier.de/db/conf/performance/

面向FPGA的布局与布线技术研究综述

专知会员服务

26+阅读 · 2022年9月3日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日