One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while keeping its prediction accuracies.
翻译:在培训参数服务器(PS)上培训大型机器学习模型方面,一个最重要的瓶颈是通信管理,因为它需要在培训迭代期间经常交换工人和服务器之间的模型梯度。 推荐了渐变量, 作为减少通信量的有效方法。 渐变量量化的一个关键问题是设置梯度量化的比特数。 少量位数可以显著减少通信管理量,同时伤害梯度加速度, 反向。 理想的量化方法将动态平衡通信管理量和模型准确性, 因为它需要根据从近期培训迭代中获取的知识调整数字比特。 但是, 现有的方法可以量化梯度, 以固定位数的位数或预先定义的偏差规则来减少通信量。 在本文中, 我们提出了一个新的适应性量化方法。 被称为 MQGrad, 将优化选择比特数作为马可夫决策过程( MDP) 显示的行动, 在那里, MDP 记录了从过去的精细度中采集的比特位数 QDL 。 学习了从过去的精度序列到MDL 更新。 学习了MD 的精度 的精度 。