While the maximum entropy (MaxEnt) reinforcement learning (RL) framework -- often touted for its exploration and robustness capabilities -- is usually motivated from a probabilistic perspective, the use of deep probabilistic models has not gained much traction in practice due to their inherent complexity. In this work, we propose the adoption of latent variable policies within the MaxEnt framework, which we show can provably approximate any policy distribution, and additionally, naturally emerges under the use of world models with a latent belief state. We discuss why latent variable policies are difficult to train, how naive approaches can fail, then subsequently introduce a series of improvements centered around low-cost marginalization of the latent state, allowing us to make full use of the latent state at minimal additional cost. We instantiate our method under the actor-critic framework, marginalizing both the actor and critic. The resulting algorithm, referred to as Stochastic Marginal Actor-Critic (SMAC), is simple yet effective. We experimentally validate our method on continuous control tasks, showing that effective marginalization can lead to better exploration and more robust training.
翻译:虽然最大增殖(MaxEnt)强化学习(RL)框架 -- -- 通常被指为其探索和强健能力 -- -- 通常是从概率角度出发,但深概率模型的使用由于内在的复杂性,在实践中没有多大的吸引力。在这项工作中,我们建议在MaxEnt框架内采用潜伏的可变政策,我们表明,这些政策可以明显地接近任何政策分布,此外,自然地在使用具有潜伏信仰状态的世界模型下出现。我们讨论了潜在的变数政策为何难以培训,如何天真的方法会失败,然后又推出一系列以潜伏状态低成本边缘化为中心的改进措施,使我们能够以最低的额外费用充分利用潜伏状态。我们根据演员和评论家的边际框架,即我们即用我们的方法,将演员和评论者边缘化。由此产生的算法(称为Stochaticric Marginal Acor-Critic (SMAC)),是简单而有效的。我们实验性地验证了我们的持续控制任务的方法,表明有效的边缘化可以导致更好的探索和更有力的培训。