Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. Despite the remarkable performance of distributional RL, a theoretical understanding of its advantages over expectation-based RL remains elusive. In this paper, we interpret distributional RL as entropy-regularized maximum likelihood estimation in the \textit{neural Z-fitted iteration} framework, and establish the connection of the resulting risk-aware regularization with maximum entropy RL. In addition, We shed light on the stability-promoting distributional loss with desirable smoothness properties in distributional RL, which can yield stable optimization and guaranteed generalization. We also analyze the acceleration behavior while optimizing distributional RL algorithms and show that an appropriate approximation to the true target distribution can speed up the convergence. From the perspective of representation, we find that distributional RL encourages state representation from the same action class classified by the policy in tighter clusters. Finally, we propose a class of \textit{Sinkhorn distributional RL} algorithm that interpolates between the Wasserstein distance and maximum mean discrepancy~(MMD). Experiments on a suite of Atari games reveal the competitive performance of our algorithm relative to existing state-of-the-art distributional RL algorithms.
翻译:分配强化学习 ~ (RL) 是一组最先进的算法, 用来估计总回报率的整体分布, 而不是仅仅估计其预期值。 尽管分配RL的表现令人瞩目, 但是对于其优于期望值的RL的理论性理解仍然难以实现。 在本文中, 我们将分配RL解释为在\ textit{ nuralZ 适合的迭代框架里对真正目标分布的适当近似值可以加速趋同。 从代表的角度, 我们发现分配RL鼓励按政策分类的同一行动类别在更紧的集群中进行国家代表。 此外, 我们提出在分配RLL中, 稳定促进分配损失与适当的平稳性能, 这可以产生稳定的优化和保证普遍化。 我们还分析加速行为,同时优化分配RLL的算法, 并表明对真正目标分布的适当近似值可以加快趋同。 从代表的角度看, 我们发现分配RLL鼓励按更紧密的组合政策分类的同一行动类别进行国家代表制。 最后, 我们提议在最大水平上, 微调RLRLRLR} 的S- oral oral oral oration orationalationalation orageslation ragation 之间, rabal dalate the the theslationslationslevationslations