FTPIPEHD:异异源边缘装置的防故障管管管-帕拉尔分布式训练框架 (FTPipeHD: A Fault-Tolerant Pipeline-Parallel Distributed Training Framework for Heterogeneous Edge Devices)

With the increased penetration and proliferation of Internet of Things (IoT) devices, there is a growing trend towards distributing the power of deep learning (DL) across edge devices rather than centralizing it in the cloud. This development enables better privacy preservation, real-time responses, and user-specific models. To deploy deep and complex models to edge devices with limited resources, model partitioning of deep neural networks (DNN) model is necessary, and has been widely studied. However, most of the existing literature only considers distributing the inference model while still relying centralized cloud infrastructure to generate this model through training. In this paper, we propose FTPipeHD, a novel DNN training framework that trains DNN models across distributed heterogeneous devices with fault tolerance mechanism. To accelerate the training with time-varying computing power of each device, we optimize the partition points dynamically according to real-time computing capacities. We also propose a novel weight redistribution approach that replicates the weights to both the neighboring nodes and the central node periodically, which combats the failure of multiple devices during training while incurring limited communication cost. Our numerical results demonstrate that FTPipeHD is 6.8x faster in training than the state of the art method when the computing capacity of the best device is 10x greater than the worst one. It is also shown that the proposed method is able to accelerate the training even with the existence of device failures.

翻译：随着Tings(IoT)装置互联网的渗透和扩散的增加,人们越来越倾向于在边缘设备之间分配深层次学习(DL)的力量,而不是将其集中到云层中。这种发展可以更好地保护隐私、实时反应和用户特有的模型。为了在资源有限的情况下将深层次和复杂的模型运用到边缘设备,有必要对深神经网络模型进行模型分割,并且已经进行了广泛的研究。然而,大多数现有文献只考虑传播推论模型,同时仍然依靠中央云层基础设施来通过培训生成这一模型。在本文件中,我们提议建立一个新型DNN培训框架,在分布式的多式装置中培训DNNN模型,但有缺陷的容忍机制。为了加快使用时间变化计算每个装置的能力的培训,我们根据实时计算能力动态优化分区点。我们还提议一种新型重力再分配方法,将重量复制到相邻节点和中央节点,从而在培训期间克服多种装置的故障,同时承担有限的通信成本。我们提出的FTPIPHHD,这个新的DNN培训框架,即最差的计算能力在10级的计算方法中显示,最先进的方法在10级的加速方法中,其使用速度也高于6.HDDDDA的方法。