While the accuracy of convolutional neural networks has achieved vast improvements by introducing larger and deeper network architectures, also the memory footprint for storing their parameters and activations has increased. This trend especially challenges power- and resource-limited accelerator designs, which are often restricted to store all network data in on-chip memory to avoid interfacing energy-hungry external memories. Maximizing the network size that fits on a given accelerator thus requires to maximize its memory utilization. While the traditionally used ping-pong buffering technique is mapping subsequent activation layers to disjunctive memory regions, we propose a mapping method that allows these regions to overlap and thus utilize the memory more efficiently. This work presents the mathematical model to compute the maximum activations memory overlap and thus the lower bound of on-chip memory needed to perform layer-by-layer processing of convolutional neural networks on memory-limited accelerators. Our experiments with various real-world object detector networks show that the proposed mapping technique can decrease the activations memory by up to 32.9%, reducing the overall memory for the entire network by up to 23.9% compared to traditional ping-pong buffering. For higher resolution de-noising networks, we achieve activation memory savings of 48.8%. Additionally, we implement a face detector network on an FPGA-based camera to validate these memory savings on a complete end-to-end system.
翻译:虽然进化神经网络的准确性已经通过引入更大和更深的网络结构而取得了巨大的改进,但存储参数和启动功能的记忆足迹也有所增加。 这一趋势尤其挑战了电力和资源有限的加速器设计,这些设计往往局限于将所有网络数据储存在芯片记忆中,以避免能源饥饿的外部记忆相互影响。 最大限度地扩大适合特定加速器的网络规模,从而最大限度地发挥对存储器的利用。 传统上使用的乒乓缓冲技术正在绘制随后启动层图,将其绘制为分离性记忆区域,但我们提议了一种绘图方法,允许这些区域相互重叠,从而更有效地利用记忆。 这项工作展示了计算最大激活记忆重叠的数学模型,从而降低了对芯片记忆网络进行逐层处理所需的芯片存储的束缚。 我们对各种基于真实世界天体的天体探测器网络的实验表明,拟议的映射技术可以将激活记忆减少至32.9 %,从而将整个终端网络的总体记忆减少至23.9 %。 将我们实现高分辨率网络的存储率提高到48.8 。 将我们实现高分辨率网络的缓冲测试,我们实现更高的缓冲。