Deploying neural networks on constrained hardware platforms such as 32-bit microcontrollers is a challenging task because of the large memory, computing and energy requirements of their inference process. To tackle these issues, several convolution primitives have been proposed to make the standard convolution more computationally efficient. However, few of these primitives are really implemented for 32-bit microcontrollers. In this work, we collect different state-of-the-art convolutional primitives and propose an implementation for ARM Cortex-M processor family with an open source deployment platform (NNoM). Then, we carry out experimental characterization tests on these implementations. Our benchmark reveals a linear relationship between theoretical MACs and energy consumption. Thus showing the advantages of using computationally efficient primitives like shift convolution. We discuss about the significant reduction in latency and energy consumption due to the use of SIMD instructions and highlight the importance of data reuse in those performance gains. For reproducibility purpose and further experiments, codes and experiments are publicly available.
翻译:在受限硬件平台上部署神经网络,如32位微控制器,是一项具有挑战性的任务,因为它们的推理过程需要大量的内存、计算和能量。为了解决这些问题,提出了几种卷积原语,以使标准卷积更具计算效率。然而,其中很少有原语真正针对32位微控制器进行实现。在这项工作中,我们收集了不同的最新卷积原语,并提出了一种基于ARM Cortex-M处理器系列和NNoM开源部署平台的实现。然后,我们对这些实现进行实验性的表征测试。我们的基准测试揭示了理论MAC和能量消耗之间的线性关系。因此,显示出使用计算效率更高的原语,如移位卷积,在延迟和能量消耗方面的优势。我们讨论了由于SIMD指令的使用引起的延迟和能量消耗的显着降低,并强调了数据重用在这些性能提高中的重要性。为了可重复性目的和进一步的实验,代码和实验是公开可用的。