与压缩更新的时空差异学习:错误反馈满足强化学习 (Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning)

In large-scale machine learning, recent works have studied the effects of compressing gradients in stochastic optimization in order to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in large-scale, multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? In this paper, we investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our main technical contribution is to show that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. We then extend our results significantly to nonlinear stochastic approximation algorithms and multi-agent settings. In particular, we prove that for multi-agent TD learning, one can achieve linear convergence speedups in the number of agents while communicating just $\tilde{O}(1)$ bits per agent at each time step. Our work is the first to provide finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our analysis hinges on studying the drift of a novel Lyapunov function that captures the dynamics of a memory variable introduced by error feedback.

翻译：在大规模机器学习中,最近的工作研究了压缩随机优化中梯度以缓解通信瓶颈的压缩梯度的效果。这些工作共同揭示了随机梯度梯度下降(SGD)对于结构性扰动(如四分位化、垃圾化和延迟 ) 具有很强的作用。令人惊讶的是,尽管对大规模多试剂强化学习的兴趣激增,但类似的问题却几乎一无所知:常见的强化学习(RL)算法是否同样强于类似的扰动?在本文件中,我们研究这一问题的方法是研究经典时间差异(TD)学习逻辑变异的变异(TD),并研究带有周期性更新方向的流动动态算法,即使用一般压缩梯度下降(SGD)来模拟扰动。我们的主要技术贡献是显示压缩的TD算法,加上在优化中广泛使用的错误反馈机制,却展示了与SGD对应方相同的非默认的理论保证。我们随后通过非线性缩缩缩缩算算算法和多试剂设置来大幅扩展我们的结果。特别是,我们在多试剂的直线性缩缩缩缩缩逻辑运行运行中,在每平级计算中,我们每次的递缩缩缩缩缩缩的计算函数分析中,可以实现一个伸缩缩缩缩缩缩的计算结果,在每平的缩缩缩缩缩缩算函数的计算结果中,在比的缩算法级计算中,在比的缩算法级计算。