The second-order training methods can converge much faster than first-order optimizers in DNN training. This is because the second-order training utilizes the inversion of the second-order information (SOI) matrix to find a more accurate descent direction and step size. However, the huge SOI matrices bring significant computational and memory overheads in the traditional architectures like GPU and CPU. On the other side, the ReRAM-based process-in-memory (PIM) technology is suitable for the second-order training because of the following three reasons: First, PIM's computation happens in memory, which reduces data movement overheads; Second, ReRAM crossbars can compute SOI's inversion in $O\left(1\right)$ time; Third, if architected properly, ReRAM crossbars can perform matrix inversion and vector-matrix multiplications which are important to the second-order training algorithms. Nevertheless, current ReRAM-based PIM techniques still face a key challenge for accelerating the second-order training. The existing ReRAM-based matrix inversion circuitry can only support 8-bit accuracy matrix inversion and the computational precision is not sufficient for the second-order training that needs at least 16-bit accurate matrix inversion. In this work, we propose a method to achieve high-precision matrix inversion based on a proven 8-bit matrix inversion (INV) circuitry and vector-matrix multiplication (VMM) circuitry. We design \archname{}, a ReRAM-based PIM accelerator architecture for the second-order training. Moreover, we propose a software mapping scheme for \archname{} to further optimize the performance by fusing VMM and INV crossbar. Experiment shows that \archname{} can achieve an average of 115.8$\times$/11.4$\times$ speedup and 41.9$\times$/12.8$\times$energy saving compared to a GPU counterpart and PipeLayer on large-scale DNNs.
翻译:第二阶培训方法比DNN培训的第一阶优化方法要快得多。 这是因为第二阶培训利用了二阶信息(SOI)矩阵的反转以找到更准确的下降方向和步骤大小。 然而, 巨大的SOI矩阵在GPU和CPU等传统架构中带来大量的计算和记忆管理管理。 另一方面, 以 RRA 为基础的进程- 模拟(PIM) 技术适合第二阶培训, 原因如下: 第一, PIM 计算发生在存储中, 减少数据移动的基数; 第二, RAM 设计交叉栏可以将SOI 的反译为$left(1\right) 时间; 第三, 如果设计得当, RAM 交叉列可以对第二阶培训算法进行矩阵和矢量的倍增。 然而, 目前基于 RAM PIM 和 PIM 技术在加速第二阶培训方面仍面临一个关键的挑战。 目前基于 RAM $9 的第二阶调基矩阵矩阵矩阵矩阵矩阵的当前 RIM 只能支持一个基于 IM IM IM 系统 的 IM 和 IM IM 系统 运行中 的 的 系统化方法 。