Massive amounts of data have led to the training of large-scale machine learning models on a single worker inefficient. Distributed machine learning methods such as Parallel-SGD have received significant interest as a solution to tackle this problem. However, the performance of distributed systems does not scale linearly with the number of workers due to the high network communication cost for synchronizing gradients and parameters. Researchers have proposed techniques such as quantization and sparsification to alleviate this problem by compressing the gradients. Most of the compression schemes result in compressed gradients that cannot be directly aggregated with efficient protocols such as all-reduce. In this paper, we present a set of all-reduce compatible gradient compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD. We present the results of our experiments with the CIFAR10 dataset and observations derived during the process. Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks. Code is available at the repository: \url{https://github.com/vineeths96/Gradient-Compression}.
翻译:大量数据导致对单一工人的大型机器学习模式进行了培训; 分散式机器学习方法,如平行SGD,作为解决这一问题的一种解决办法,引起了极大的兴趣; 然而,由于同步梯度和参数的网络通信成本高,分布式系统的业绩与工人人数相比并没有线性地扩大; 研究人员提出了通过压缩梯度来缓解这一问题的技术,例如量化和压缩技术等; 大多数压缩方案导致压缩梯度,无法直接用诸如所有编辑等高效协议加以汇总。 在本文件中,我们介绍了一套所有减少的兼容梯度压缩方案,在保持Vanilla SGD的性能的同时,大大降低了通信管理费用。 我们介绍了我们与CRFAR10的数据集和观测实验的结果。 我们的压缩方法比目前由深层学习框架提供的内部方法效果更好。 代码可在存储处查阅:\url{https://github.com/vineths96/Gradient-Compression}。