Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. Despite the remarkable performance of distributional RL, a theoretical understanding of its advantages over expectation-based RL remains elusive. In this paper, we illuminate the superiority of distributional RL from both regularization and optimization perspectives. Firstly, by applying an expectation decomposition, the additional impact of distributional RL compared with expectation-based RL is interpreted as a \textit{risk-aware entropy regularization} in the \textit{neural Z-fitted iteration} framework. We also provide a rigorous comparison between the resulting entropy regularization and the vanilla one in maximum entropy RL. Through the lens of optimization, we shed light on the stability-promoting distributional loss with desirable smoothness properties in distributional RL. Moreover, the acceleration effect of distributional RL owing to the risk-aware entropy regularization is also provided. Finally, rigorous experiments reveal the different regularization effects as well as the mutual impact of vanilla entropy and risk-aware entropy regularization in distributional RL, focusing specifically on actor-critic algorithms. We also empirically verify that the distributional RL algorithm enjoys a more stable gradient behavior, contributing to its stable optimization and acceleration effect as opposed to classical RL. Our research paves a way towards better interpreting the superiority of distributional RL algorithms.
翻译:分配强化学习 ~ (RL) 是一类最先进的算法, 用来估计总回报率的整体分布情况, 而不是仅仅估计其预期值。 尽管分配RL的表现令人瞩目的, 但对于其优于预期值的理论理解仍然难以实现。 在本文中, 我们从正规化和优化的角度来说明分配RL的优越性。 首先, 通过应用预期分解, 分配RL与基于预期的RL相比的额外影响被解释为在\textit{ 风险- 觉知性递归正规化 框架\ textitilit{ 风险- 风险- 系统化 } 中的一种最先进的计算法。 最后, 严格的实验显示, 由此产生的变现的变现性( Rentropil) 正规化和最大变现性Rilla LLL。 通过优化的透镜,我们展示了稳定- 分布损失, 以及分配的稳定性特性。 此外, 分配RLLL的加速效应, 由于风险- trropropy 正规化 的递归性 方法, 更明显地展示了 的递归正(ROtral) 递化 递化 递化, 递化的递化 的递增 风险 。