Consider the following distributed optimization scenario. A worker has access to training data that it uses to compute the gradients while a server decides when to stop iterative computation based on its target accuracy or delay constraints. The server receives all its information about the problem instance from the worker via a rate-limited noiseless communication channel. We introduce the principle we call Differential Quantization (DQ) that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $\max\{\sigma_{\mathrm{GD}}, \rho_n 2^{-R}\}$, where $\sigma_{\mathrm{GD}}$ is the contraction factor of unquantized gradient descent (GD), $\rho_n \geq 1$ is the covering efficiency of the quantizer, and $R$ is the bitrate per problem dimension $n$. Thus at any $R\geq\log_2 \rho_n /\sigma_{\mathrm{GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show that no algorithm within a certain class can converge faster than $\max\{\sigma_{\mathrm{GD}}, 2^{-R}\}$. Since quantizers exist with $\rho_n \to 1$ as $n \to \infty$ (Rogers, 1963), this means that DQ-GD is asymptotically optimal. The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak's heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart. Experimental results on least-squares problems validate our theoretical analysis.
翻译:考虑以下分布优化方案 。 工人可以获得用于计算梯度的培训数据 。 假设目标功能是平滑的, 强烈的 convex, 我们证明, 以目标准确度或延迟限制来停止迭代计算 。 服务器通过一个无声通信频道从工人那里接收关于问题实例的所有信息 。 我们引入了一种原则, 我们称之为差异定量( DQ), 以补偿过去的夸度错误, 以引导一个未量化的算法的下降轨迹向未量化的对应方的下降轨迹 。 假设目标函数是平滑的, 强烈的平滑度源值( DQG), 将一个纯度的振量的振度( DQQ- QQ) 平流度( QQQ- QQ), 将一个直线性收缩系数( GQ_ Q_ QQ_ Q_ Q_ ), 将一个纯度的递增度( QQ_ Q_ Q_ Q_ g) road) 法作为不振度的缩法 。 。 以 以 以 平地 平地 平地算法显示 。 。 以 平地 以 以 平 平 平 平 。 平 平 。