Various hardware accelerators have been developed for energy-efficient and real-time inference of neural networks on edge devices. However, most training is done on high-performance GPUs or servers, and the huge memory and computing costs prevent training neural networks on edge devices. This paper proposes a novel tensor-based training framework, which offers orders-of-magnitude memory reduction in the training process. We propose a novel rank-adaptive tensorized neural network model, and design a hardware-friendly low-precision algorithm to train this model. We present an FPGA accelerator to demonstrate the benefits of this training method on edge devices. Our preliminary FPGA implementation achieves $59\times$ speedup and $123\times$ energy reduction compared to embedded CPU, and $292\times$ memory reduction over a standard full-size training.
翻译:开发了各种硬件加速器,用于对边缘装置的神经网络进行节能实时推断,但是,大多数培训都是在高性能GPU或服务器上进行的,而巨大的记忆和计算成本阻碍了对边缘装置的神经网络的培训。本文件提出了一个新型的以高压为基础的培训框架,在培训过程中减少磁性存储量。我们提出了一个新的等级适应性强神经网络模型,并设计了一个方便于硬件的低精度算法来培训这一模型。我们提出了一个FPGA加速器,以展示这种培训方法在边缘装置上的好处。我们初步的FPGA实施实现了59美元的加速和123美元的能源削减,与嵌入式CPU相比,在标准的全面培训中减少了292美元的存储量。