SoCs are now designed with their own AI accelerator segment to accommodate the ever-increasing demand of Deep Learning (DL) applications. With powerful MAC engines for matrix multiplications, these accelerators show high computing performance. However, because of limited memory resources (i.e., bandwidth and capacity), they fail to achieve optimum system performance during large batch training and inference. In this work, we propose a memory system with high on-chip capacity and bandwidth to shift the gear of AI accelerators from memory-bound to achieving system-level peak performance. We develop the memory system with DTCO-enabled customized SOT-MRAM as large on-chip memory through STCO and detailed characterization of the DL workloads. %We evaluate our workload-aware memory system on the CV and NLP benchmarks and observe significant PPA improvement compared to an SRAM-based in both inference and training modes. Our workload-aware memory system achieves 8X energy and 9X latency improvement on Computer Vision (CV) benchmarks in training and 8X energy and 4.5X latency improvement on Natural Language Processing (NLP) benchmarks in training while consuming only around 50% of SRAM area at iso-capacity.
翻译:现在,在SoC中,它们自己具有AI加速器部分,以适应深度学习(DL)应用程序的不断增长的需求。这些加速器具有用于矩阵乘法的强大的MAC引擎,表现出高计算性能。但是,由于存储资源(即带宽和容量)有限,它们无法在大规模批量训练和推理期间实现最佳系统性能。在这项工作中,我们提出了一种具有高芯片容量和带宽的存储系统,以将AI加速器的档位从存储器限制区域转向实现系统级峰值性能。我们使用DTCO为大型芯片内存开发具有自定义SOT-MRAM的存储系统,通过STCO和对DL工作负载的详细特性分析。% 我们对CV和NLP基准测试中的工作负载感知存储系统进行评估,在推理和训练模式下与基于SRAM的比较,观察到显着的PPA改进。我们的工作负载感知存储系统在训练CV基准测试时实现8倍的能量和9倍的延迟改进,在训练NLP基准测试时实现8倍的能量和4.5倍的延迟改进,同时在同等容量下仅消耗约50%的SRAM面积。