Gradient quantization is an emerging technique in reducing communication costs in distributed learning. Existing gradient quantization algorithms often rely on engineering heuristics or empirical observations, lacking a systematic approach to dynamically quantize gradients. This paper addresses this issue by proposing a novel dynamically quantized SGD (DQ-SGD) framework, enabling us to dynamically adjust the quantization scheme for each gradient descent step by exploring the trade-off between communication cost and convergence error. We derive an upper bound, tight in some cases, of the convergence error for a restricted family of quantization schemes and loss functions. We design our DQ-SGD algorithm via minimizing the communication cost under the convergence error constraints. Finally, through extensive experiments on large-scale natural language processing and computer vision tasks on AG-News, CIFAR-10, and CIFAR-100 datasets, we demonstrate that our quantization scheme achieves better tradeoffs between the communication cost and learning performance than other state-of-the-art gradient quantization methods.
翻译:现有梯度量化算法往往依赖工程超常或经验性观测,缺乏对梯度进行动态量化的系统方法。本文件通过提出一个新的动态量化 SGD(DQ-SGD)框架来解决这一问题,使我们能够通过探索通信成本和汇合错误之间的权衡,动态调整每个梯度梯度下降步骤的量化计划。我们发现,对于有限的量化计划和损失功能的组合,我们往往依赖工程超常或经验性观测法,我们设计DQ-SGD算法,在趋同错误限制下尽量减少通信成本。最后,通过对AG-News、CIFAR-10和CIFAR-100数据集的大规模自然语言处理和计算机视觉任务进行广泛实验,我们证明,我们的四分法在通信成本和学习绩效之间的权衡比其他最先进的梯度量化方法要好。