Reducing the memory footprint of neural networks is a crucial prerequisite for deploying them in small and low-cost embedded devices. Network parameters can often be reduced significantly through pruning. We discuss how to best represent the indexing overhead of sparse networks for the coming generation of Single Instruction, Multiple Data (SIMD)-capable microcontrollers. From this, we develop Delta-Compressed Storage Row (dCSR), a storage format for sparse matrices that allows for both low overhead storage and fast inference on embedded systems with wide SIMD units. We demonstrate our method on an ARM Cortex-M55 MCU prototype with M-Profile Vector Extension(MVE). A comparison of memory consumption and throughput shows that our method achieves competitive compression ratios and increases throughput over dense methods by up to $2.9 \times$ for sparse matrix-vector multiplication (SpMV)-based kernels and $1.06 \times$ for sparse matrix-matrix multiplication (SpMM). This is accomplished through handling the generation of index information directly in the SIMD unit, leading to an increase in effective memory bandwidth.
翻译:减少神经网络的内存足迹是将其部署在小型和低成本嵌入装置中的关键先决条件。网络参数通常可以通过修剪而大大降低。我们讨论如何最好地代表稀薄网络的间接管理量,用于下一代单一指令、多数据(SIMD)能力强的微控制器。我们从这一点出发,开发了Delta-压缩存储器(dCSR),这是稀薄矩阵的储存格式,允许低间接费用存储量和在带有广度SIMD装置的嵌入系统上快速推断。我们用一个配有M-Profile矢量扩展(MVE)的ARM Cortex-M55 MCU原型展示了我们的方法。对记忆消耗量和吞吐量的比较表明,我们的方法实现了竞争性压缩率,并增加了密集方法的吞吐量,即高达2.9美元用于稀薄矩阵-昆虫倍增作用(SpMV)基内核内核,1.06美元用于稀薄矩阵矩阵倍增(SpMMM),这是通过直接处理SIMD单元的索引信息的生成而实现有效记忆带宽度增加的。