This paper addresses policy learning in non-stationary environments and games with continuous actions. Rather than the classical reward maximization mechanism, inspired by the ideas of follow-the-regularized-leader (FTRL) and mirror descent (MD) update, we propose a no-regret style reinforcement learning algorithm PORL for continuous action tasks. We prove that PORL has a last-iterate convergence guarantee, which is important for adversarial and cooperative games. Empirical studies show that, in stationary environments such as MuJoCo locomotion controlling tasks, PORL performs equally well as, if not better than, the soft actor-critic (SAC) algorithm; in non-stationary environments including dynamical environments, adversarial training, and competitive games, PORL is superior to SAC in both a better final policy performance and a more stable training process.
翻译:本文讨论的是非静止环境中的政策学习和持续行动游戏,而不是传统奖励最大化机制,它受到后续正规领导者(FTRL)和镜面下层(MD)更新理念的启发,我们提议对连续行动任务采用无记录式强化学习算法(PORL),我们证明PORL拥有对对抗和合作游戏十分重要的最后程度趋同保证,经验研究表明,在穆乔科移动控制任务等固定环境中,PORL表现平等,如果不是比软性行为者-critic(SAC)算法更好的话;在非固定环境中,包括动态环境、对抗性培训和竞争性游戏,PORL在更好的最后政策表现和更稳定的培训过程中优于SAC。