用于加速边缘推断的神经网络内存-软件引信和串联 (Memory-Aware Fusing and Tiling of Neural Networks for Accelerated Edge Inference)

A rising research challenge is running costly machine learning (ML) networks locally on resource-constrained edge devices. ML networks with large convolutional layers can easily exceed available memory, increasing latency due to excessive swapping. Previous memory reduction techniques such as pruning and quantization reduce model accuracy and often require retraining. Alternatively, distributed methods partition the convolutions into equivalent smaller sub-computations, but the implementations introduce communication costs and require a network of devices. However, a distributed partitioning approach can also be used to run in a reduced memory footprint on a single device by subdividing the network into smaller operations. This report extends prior work on distributed partitioning using tiling and fusing of convolutional layers into a memory-aware execution on a single device. Our approach extends prior fusing strategies to allow for two groups of convolutional layers that are fused and tiled independently. This approach reduces overhead via data reuse, and reduces the memory footprint further. We also propose a memory usage predictor coupled with a search algorithm to provide fusing and tiling configurations for an arbitrary set of convolutional layers. When applied to the YOLOv2 object detection network, results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints. Additionally, our algorithm will return a configuration with a latency that is within 6% of the best latency measured in a manual search.

翻译：在资源紧张的边缘装置上,日益上升的研究挑战是在当地运行昂贵的机器学习(ML)网络。具有大卷发层的ML网络可以很容易地超过现有记忆,由于过度交换而增加延缓力。以前的记忆减少技术,如剪裁和量分解,降低了模型的准确性,往往需要再培训。或者,分散的方法将卷发成相等的次截面,但实施的方法引入了通信成本,需要一个设备网络。然而,分散的分区方法也可以用来在一个单一装置上的记忆足迹中运行,通过将网络细分成较小的操作来减少记忆足迹。本报告将先前的分布分配分解工作通过对卷层的吸附和粘附而扩大到单一装置的记忆-认知执行。我们的方法扩展了先前的记忆减少技术,使两组相连接和节流分解分解的组合能够通过数据再利用来降低存储管理费用,进一步减少记忆足迹。我们还提出一个记忆使用预测,同时使用一种搜索算法,为任意设置的卷积层结构提供断和缩配置。本报告将先前关于分配的分解工作的工作,在单一层层层平面层的平段段段内,在应用时,将使用一个最慢的搜索方法下,将显示二号将显示一个最慢的缩后,在最慢的图像的轨道上,在最慢的轨道上,将显示一个可追溯到最慢的轨道上。