The distributional reinforcement learning (RL) approach advocates for representing the complete probability distribution of the random return instead of only modelling its expectation. A distributional RL algorithm may be characterised by two main components, namely the representation and parameterisation of the distribution and the probability metric defining the loss. This research considers the unconstrained monotonic neural network (UMNN) architecture, a universal approximator of continuous monotonic functions which is particularly well suited for modelling different representations of a distribution (PDF, CDF, quantile function). This property enables the decoupling of the effect of the function approximator class from that of the probability metric. The paper firstly introduces a methodology for learning different representations of the random return distribution. Secondly, a novel distributional RL algorithm named unconstrained monotonic deep Q-network (UMDQN) is presented. Lastly, in light of this new algorithm, an empirical comparison is performed between three probability quasimetrics, namely the Kullback-Leibler divergence, Cramer distance and Wasserstein distance. The results call for a reconsideration of all probability metrics in distributional RL, which contrasts with the dominance of the Wasserstein distance in recent publications.
翻译:分配强化学习 (RL) 方法主张代表随机返回的完全概率分布,而不是仅仅模拟其预期。 分配RL 算法可以用两个主要组成部分来定性, 即分布的表示和参数化以及确定损失的概率度量度。 本研究考虑了不受限制的单调神经网络架构(UMNN), 一种通用的连续单声波函数相近器, 它特别适合于模拟分布的不同表达方式( PDF、 CDF、 量子函数) 。 这个属性使得功能相近器类与概率度测量的分解效果。 一种分布式RL 算法首先引入了一种方法, 学习随机返回分布的不同表达方式。 其次, 介绍了名为不受限制单调单调的单调深Q- 网络(UMDQN) 的新型分配式RL 算法。 最后, 根据这一新算法, 在三种概率准准度( Kullack- Leiber 差、 Cramer 距离和 Wasserstein 距离) 等距离之间进行了实验性比较。 结果要求重新考虑所有可能性的分布中心标准出版物。