The distributional reinforcement learning (RL) approach advocates for representing the complete probability distribution of the random return instead of only modelling its expectation. A distributional RL algorithm may be characterised by two main components, namely the representation of the distribution together with its parameterisation and the probability metric defining the loss. The present research work considers the unconstrained monotonic neural network (UMNN) architecture, a universal approximator of continuous monotonic functions which is particularly well suited for modelling different representations of a distribution. This property enables the efficient decoupling of the effect of the function approximator class from that of the probability metric. The research paper firstly introduces a methodology for learning different representations of the random return distribution (PDF, CDF and QF). Secondly, a novel distributional RL algorithm named unconstrained monotonic deep Q-network (UMDQN) is presented. To the authors' knowledge, it is the first distributional RL method supporting the learning of three, valid and continuous representations of the random return distribution. Lastly, in light of this new algorithm, an empirical comparison is performed between three probability quasi-metrics, namely the Kullback-Leibler divergence, Cramer distance, and Wasserstein distance. The results highlight the main strengths and weaknesses associated with each probability metric together with an important limitation of the Wasserstein distance.
 翻译:分布式强化学习(RL)方法主张通过完整地描述随机回报概率分布来代替仅建模其期望值。分布式RL算法可以由两个主要组成部分所特征化,即代表概率分布的表示和其参数化,以及确定损失函数的概率度量。本研究采用了非约束性单调神经网络(UMNN)架构,它是一种能够模拟连续单调函数的通用逼近器,特别适用于模拟概率分布的不同表示形式。这种属性使得函数逼近器类的影响可以有效地解耦。本研究首先介绍了一种学习随机回报概率分布不同表示形式的方法(PDF、CDF和QF)。其次,介绍了一种名为非约束性单调深度Q网络(UMDQN)的新型分布式RL算法。据作者所知,这是第一个支持学习三种有效且连续的随机回报分布表示的分布式RL方法。最后,在这个新算法的基础上,通过比较三种概率准度量,即Kullback-Leibler散度、Cramer距离和Wasserstein距离进行了实证研究。结果突出展示了每个概率指标的主要优势和劣势,以及Wasserstein距离的重要局限性。