This paper deals with distributed policy optimization in reinforcement learning, which involves a central controller and a group of learners. In particular, two typical settings encountered in several applications are considered: multi-agent reinforcement learning (RL) and parallel RL, where frequent information exchanges between the learners and the controller are required. For many practical distributed systems, however, the overhead caused by these frequent communication exchanges is considerable, and becomes the bottleneck of the overall performance. To address this challenge, a novel policy gradient approach is developed for solving distributed RL. The novel approach adaptively skips the policy gradient communication during iterations, and can reduce the communication overhead without degrading learning performance. It is established analytically that: i) the novel algorithm has convergence rate identical to that of the plain-vanilla policy gradient; while ii) if the distributed learners are heterogeneous in terms of their reward functions, the number of communication rounds needed to achieve a desirable learning accuracy is markedly reduced. Numerical experiments corroborate the communication reduction attained by the novel algorithm compared to alternatives.
翻译:本文论述在强化学习方面分散的政策优化,涉及中央控制员和一组学习者。特别是,在若干应用中遇到的两个典型环境得到考虑:多剂强化学习(RL)和平行RL,需要学习者与控制者经常交流信息。然而,对于许多实际分布的系统来说,这些频繁的交流交流所引发的间接费用相当可观,成为总体业绩的瓶颈。为了应对这一挑战,制定了一种新的政策梯度方法来解决分布式学习。新颖方法在迭代期间适应性地跳过政策梯度通信,可以减少通信间接费用而不降低学习成绩。它从分析上确定:(1) 新的算法具有与平凡利政策梯度相同的趋同率;(2) 如果分布的学习者在奖励功能方面各不相同,则实现理想学习准确性所需的通信轮数明显减少。数字实验证实了新算法相对于替代法的通信减少。