Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
翻译:通信管理费是培养大规模深层次学习模式的主要障碍之一。 渐进式宽度是减少通信量的一个大有希望的技术, 然而, 实现真正的绩效改进是非常困难的, 因为:(1) 实现可缩放和高效的稀释全部减排算法和(2) 垃圾处理管理费的困难。 本文建议Ok$- Top$k$, 这是一种使用稀薄梯度的分布式培训计划。 Ok$- Top$ 将新的稀薄的稀释式降价算法( 低于6k$的通信量, 且不那么理想 ) 与分散的平行托恰奇梯源优化( SGD) 和 其趋同 。 要降低垃圾处理的间接费用, Ok$- Topk$ 有效选择了最高至k$的梯度值, 并且根据估计的阈值进行评估。 在Piz Daint超级计算机上进行了不同深层次学习领域的神经网络模型的评估。 Empricalal 结果显示, Ok$k$k$ 实现类似模型的精确度, 在Odrevlex- dreal- dest- destaltraction 3x 上大大改进。