Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
翻译:嵌入和 IoT 设备主要由微控控器单位( MCUs) 驱动, 通过利用深层学习, 可以使嵌入和 IoT 设备更智能。 MCU 神经网络推断的主要挑战之一是读写存储量极有限( SRAM, < 512 kB) 。 SRAM 被神经网络层( 操作器) 输入和输出缓冲消耗, 传统上, 操作员必须用记忆( 物质化) 来执行。 我们讨论微控器深层学习的新执行模式, 从而改变神经网络的执行, 以避免在记忆中实现完全缓冲, 大幅降低 SRA 使用量。 这是通过利用操作操作员的特性来实现的读写( SRM) 。 我们描述一个部分执行编译器, Pex, 自动生成记忆高效的执行时间表, 其执行过程可以与特性( “ Channel” ) 维度分解的操作器。 内存系统的使用进一步减少, 将内存的精度用于将内存精度精确性化, 将内存内存的精度与结构内存运行的运行过程比 。