While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training. Our implementation is open sourced at https://github.com/zdhNarsil/Stochastic-Marginal-Actor-Critic.
翻译:虽然最大增生(MaxEnt)强化学习(RL)框架 -- -- 通常因其勘探和稳健能力而被推崇 -- -- 通常是从概率角度出发,但深概率模型的使用由于内在的复杂性而实际上没有多大的吸引力。在这项工作中,我们提议在MaxEnt框架内采用潜伏的可变政策,这种政策可以明显地接近任何政策分布,此外,在使用具有潜伏信仰状态的世界模型下自然出现。我们讨论了潜在的变数政策为何难以培训,天真的方法如何会失败,然后推出一系列以潜伏状态低成本边缘化为中心的改进措施,使我们能够以最低的额外费用充分利用潜伏状态。我们在演员和评论者中都处于边缘地位,由此产生的算法简单而有效。我们实验性地验证了我们的持续控制任务的方法,表明有效的边缘化可以导致更好的探索和更加有力的培训。我们的实施是在演员和评论者之间公开的源代码。