远电子计算中CNN加速推推推 (Design and Prototyping Distributed CNN Inference Acceleration in Edge Computing)

For time-critical IoT applications using deep learning, inference acceleration through distributed computing is a promising approach to meet a stringent deadline. In this paper, we implement a working prototype of a new distributed inference acceleration method HALP using three raspberry Pi 4. HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing. We maximize the parallelization between communication and computation among the collaborative EDs by optimizing the task partitioning ratio based on the segment-based partitioning. Experimental results show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16. Then, we combine distributed inference with conventional neural network model compression by setting up different shrinking hyperparameters for MobileNet-V1. In this way, we can further accelerate inference but at the cost of inference accuracy loss. To strike a balance between latency and accuracy, we propose dynamic model selection to select a model which provides the highest accuracy within the latency constraint. It is shown that the model selection with distributed inference HALP can significantly improve service reliability compared to the conventional stand-alone computation.

翻译：对于使用深层学习的时间临界 IoT 应用程序,通过分布式计算加速推导是一个很有希望的方法,可以达到严格的最后期限。在本文件中,我们使用三根草莓P4.HALP 设计了一个新的分布式推导加速法HALP的工作原型,用于设计边缘设备之间无缝合作,加速推导。我们通过优化基于区段的分区分隔分配,最大限度地实现合作式ED之间的通信和计算平行。实验结果显示,分布式推导 HALP 达到VGG-16的1.7x推导加速率。然后,我们将分布式推导与常规神经网络模型压缩相结合,为移动网络V1. 这样,我们可以进一步加速推导,但以推导精度损失的代价为代价。为了在粘度和准确性之间取得平衡,我们提议了动态模型选择一个模型,以在宽度限制范围内提供最高精确度的模型。我们显示,通过分布式推推法选择模型可以大大提高服务可靠性,而与常规定数计算相比。