Distributional Reinforcement Learning (RL) maintains the entire probability distribution of the reward-to-go, i.e. the return, providing more learning signals that account for the uncertainty associated with policy performance, which may be beneficial for trading off exploration and exploitation and policy learning in general. Previous works in distributional RL focused mainly on computing the state-action-return distributions, here we model the state-return distributions. This enables us to translate successful conventional RL algorithms that are based on state values into distributional RL. We formulate the distributional Bellman operation as an inference-based auto-encoding process that minimises Wasserstein metrics between target/model return distributions. The proposed algorithm, BDPG (Bayesian Distributional Policy Gradients), uses adversarial training in joint-contrastive learning to estimate a variational posterior from the returns. Moreover, we can now interpret the return prediction uncertainty as an information gain, which allows to obtain a new curiosity measure that helps BDPG steer exploration actively and efficiently. We demonstrate in a suite of Atari 2600 games and MuJoCo tasks, including well known hard-exploration challenges, how BDPG learns generally faster and with higher asymptotic performance than reference distributional RL algorithms.
翻译:分配局以前的工作主要侧重于计算国家-行动-回报分布,这里我们模拟国家-回报分布。这使我们能够将基于国家价值的成功常规RL算法转化为分配局。我们把分配局Bellman 业务设计成一个基于推论的自动编码程序,以尽量减少目标/模版回报分布之间的瓦塞尔斯坦度量度值。拟议的算法,BDPG(Bayesian分配局政策重点),在联合调试学习中使用对抗性培训,以估计回报的变异后背值。此外,我们现在可以将返回预测不确定性解释为一种信息收益,从而获得一种新的好奇度测量,帮助BDPG积极有效地指导探索。我们在一套Atari 2600游戏和MujoCo任务中展示了比众所周知的硬盘浏览率更快的学习速度。