Distributional reinforcement learning (RL) aims to learn a value-network that predicts the full distribution of the returns for a given state, often modeled via a quantile-based critic. This approach has been successfully integrated into common RL methods for continuous control, giving rise to algorithms such as Distributional Soft Actor-Critic (DSAC). In this paper, we introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation, as commonly employed in current practice. The improved distributional estimates further lend themselves to UCB-based exploration. These two ideas are combined to yield our distributional RL algorithm, E2DC (Extra Exploration with Distributional Critics). We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control. We provide further insight into the method via visualization and analysis of the learned distributions and their evolution during training.
翻译:分配强化学习(RL)旨在学习一个价值网络,预测特定状态的回报的完全分布,通常通过以数量为基础的批评家模型进行模拟。这一方法已经成功地纳入共同的RL连续控制方法,从而产生一系列连续控制任务(DSAC)等算法。我们在本文件中为分配性RL引入了多类抽样目标值(MTV),作为当前实践中常用的单一抽样目标值估计原则替代。改进的分发估计还适合于以 UCB为基础的探索。这两个想法被结合到一起产生我们的分配性RL算法(Extra Development to Spreportal Critics ) 。我们评估了我们一系列连续控制任务的方法,并展示了人类控制等困难任务的最新无模式性表现。我们通过对所学到的分布及其在培训过程中的演变进行可视化和分析,对方法进行了进一步的深入了解。