Although distributional reinforcement learning (DRL) has been widely examined in the past few years, there are two open questions people are still trying to address. One is how to ensure the validity of the learned quantile function, the other is how to efficiently utilize the distribution information. This paper attempts to provide some new perspectives to encourage the future in-depth studies in these two fields. We first propose a non-decreasing quantile function network (NDQFN) to guarantee the monotonicity of the obtained quantile estimates and then design a general exploration framework called distributional prediction error (DPE) for DRL which utilizes the entire distribution of the quantile function. In this paper, we not only discuss the theoretical necessity of our method but also show the performance gain it achieves in practice by comparing with some competitors on Atari 2600 Games especially in some hard-explored games.
翻译:虽然过去几年对分配强化学习(DRL)进行了广泛研究,但人们仍然在尝试解决两个未决问题:一个是如何确保所学的量化函数的有效性,另一个是如何有效利用分配信息。本文试图提供一些新视角,鼓励在这两个领域今后进行深入研究。我们首先建议建立一个非降序量化函数网络(NDQFN),以保证获得的量化估计数的单一性,然后设计一个使用量化函数全部分布的DRL分配预测错误(DPE)总体勘探框架。在本文中,我们不仅讨论我们方法的理论必要性,而且还通过与Atari 2600运动会的某些竞争者进行比较,特别是一些硬盘游戏,展示了它在实践中取得的业绩收益。