When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
翻译:当早期深心神经网络(DNNs)培训早期深层神经网络(DNNs),通过进化或线性层产生中间特征,在大部分执行时间占用大部分时间时,已经进行了广泛的研究,以减少进化层或线性层的计算负担;然而,在最近的移动友好型DNS中,处理这些层的相对操作数量已大大减少;因此,其他层(如批量正常化层)的执行时间比例已增加;因此,在这项工作中,我们对批次正常化层进行详细分析,以有效减少批次正常化进程中的运行时间间接费用;在全面分析后,我们提出了极为高效的批次正常化(名为LightNorm)及其相关的硬件模块。更详细地说,我们结合了三种近似技术,即(一)低位精度、(二)批量正常化和(三)块浮点。所有这些近似技术不仅用于维护中期特征地图的统计,而且用于尽量减少离芯存储器的存取。通过拟议的轻Norm硬件,我们可以在DNNNNNE培训期间实现巨大的领域和能源节约,而不会损害拟议硬件的硬件培训。