Fused Depthside Tiling在TinyML深度神经网络推断的内存优化中的应用 (Fused Depthwise Tiling for Memory Optimization in TinyML Deep Neural Network Inference)

Memory optimization for deep neural network (DNN) inference gains high relevance with the emergence of TinyML, which refers to the deployment of DNN inference tasks on tiny, low-power microcontrollers. Applications such as audio keyword detection or radar-based gesture recognition are heavily constrained by the limited memory on such tiny devices because DNN inference requires large intermediate run-time buffers to store activations and other intermediate data, which leads to high memory usage. In this paper, we propose a new Fused Depthwise Tiling (FDT) method for the memory optimization of DNNs, which, compared to existing tiling methods, reduces memory usage without inducing any run time overhead. FDT applies to a larger variety of network layers than existing tiling methods that focus on convolutions. It improves TinyML memory optimization significantly by reducing memory of models where this was not possible before and additionally providing alternative design points for models that show high run time overhead with existing methods. In order to identify the best tiling configuration, an end-to-end flow with a new path discovery method is proposed, which applies FDT and existing tiling methods in a fully automated way, including the scheduling of the operations and planning of the layout of buffers in memory. Out of seven evaluated models, FDT achieved significant memory reduction for two models by 76.2% and 18.1% where existing tiling methods could not be applied. Two other models showed a significant run time overhead with existing methods and FDT provided alternative design points with no overhead but reduced memory savings.

翻译：摘要：随着微型低功耗微控制器上部署深度神经网络（DNN）推断任务的出现，深度神经网络推断的内存优化变得越来越重要。应用程序，如音频关键字检测或基于雷达的手势识别，受到这些微型设备所限制的内存十分有限，因为DNN推断需要大型中间运行时缓冲区来存储激活和其他中间数据，这会导致内存使用量极高。本文提出了一种新的融合深度分层平铺（FDT）方法，用于DNN的内存优化。与现有平铺方法相比，FDT可以在不引入任何运行时开销的情况下减少内存使用。FDT适用于比现有平铺方法更多的网络层。它可以显著提高TinyML的内存优化，减少以前不可能减少内存的模型的内存占用，并提供对具有现有方法显示出高运行时开销模型的替代设计点。为了确定最佳的平铺配置，提出了一个新的端到端流程，其中包含一种新的路径发现方法，该方法以全自动方式应用FDT和现有的平铺方法，包括操作的调度和缓冲区的内存布局的规划。在评估的七个模型中，FDT在两个模型中实现了显著的内存减少，分别为76.2％和18.1％，现有的平铺方法无法应用。另外两个模型显示出使用现有方法时较高的运行时开销，并且FDT提供了无运行时开销但减少内存储蓄的替代设计点。