The inference of Neural Networks is usually restricted by the resources (e.g., computing power, memory, bandwidth) on edge devices. In addition to improving the hardware design and deploying efficient models, it is possible to aggregate the computing power of many devices to enable the machine learning models. In this paper, we proposed a novel method of exploiting model parallelism to separate a neural network for distributed inferences. To achieve a better balance between communication latency, computation latency, and performance, we adopt neural architecture search (NAS) to search for the best transmission policy and reduce the amount of communication. The best model we found decreases by 86.6% of the amount of data transmission compared to the baseline and does not impact performance much. Under proper specifications of devices and configurations of models, our experiments show that the inference of large neural networks on edge clusters can be distributed and accelerated, which provides a new solution for the deployment of intelligent applications in the internet of things (IoT).
翻译:神经网络的推论通常受到边缘设备资源(例如计算能力、内存、带宽)的限制。除了改进硬件设计和部署高效模型外,还有可能汇总许多设备的计算能力以使机器学习模型成为可能。在本文中,我们提出了一种新颖的方法,利用模型平行法将神经网络分离为分布式推理。为了在通信耐力、计算耐力和性能之间实现更好的平衡,我们采用了神经结构搜索(NAS),以寻找最佳传输政策并减少通信量。我们发现的最佳模型是数据传输量比基线减少86.6%,对性能影响不大。在适当的装置和模型配置规格下,我们的实验表明,大型神经网络对边缘集束的推论可以分布和加速,这为在事物互联网(IoT)中部署智能应用提供了新的解决方案。