While distributed training significantly speeds up the training process of the deep neural network (DNN), the utilization of the cluster is relatively low due to the time-consuming data synchronizing between workers. To alleviate this problem, a novel Hierarchical Parallel SGD (HPSGD) strategy is proposed based on the observation that the data synchronization phase can be paralleled with the local training phase (i.e., Feed-forward and back-propagation). Furthermore, an improved model updating method is unitized to remedy the introduced stale gradients problem, which commits updates to the replica (i.e., a temporary model that has the same parameters as the global model) and then merges the average changes to the global model. Extensive experiments are conducted to demonstrate that the proposed HPSGD approach substantially boosts the distributed DNN training, reduces the disturbance of the stale gradients and achieves better accuracy in given fixed wall-time.
翻译:虽然分布式培训大大加快了深神经网络(DNN)的培训进程,但由于工人之间数据同步耗时,该集群的利用率相对较低。为缓解这一问题,根据数据同步阶段可与当地培训阶段(即向前进和后回推进)平行同步的观察,提出了一个新的等级平行SGD(HPSGD)战略。此外,对改进的更新模式方法进行了合并,以纠正引入的变换梯度问题,该模式要求更新复制版(即具有与全球模型相同参数的临时模型),然后将平均变化与全球模型合并。进行了广泛的实验,以证明拟议的HPSGD方法可大大促进分布式DNN培训,减少变换梯度的干扰,提高固定墙时间的准确性。