Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning (DRL) via combining deep learning (DL) with reinforcement learning (RL), which has noticed that the distribution of the acquired data would change during the training process. DQN found this property might cause instability for training, so it proposed effective methods to handle the downside of the property. Instead of focusing on the unfavourable aspects, we find it critical for RL to ease the gap between the estimated data distribution and the ground truth data distribution while supervised learning (SL) fails to do so. From this new perspective, we extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and techniques can be unified into the GDI paradigm, which can be considered as one of the special cases of GDI. We provide theoretical proof of why GDI is better than GPI and how it works. Several practical algorithms based on GDI have been proposed to verify the effectiveness and extensiveness of it. Empirical experiments prove our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200M training frames. Our work aims to lead the RL research to step into the journey of conquering the human world records and seek real superhuman agents on both performance and efficiency.
翻译:深Q网络(DQN)首先通过将深层次学习(DL)与强化学习(RL)相结合,打开了深层次强化学习(DRL)的大门。DQN发现这种财产可能会造成培训不稳定,因此提出了处理财产下方的有效方法。我们发现,对于RL来说,关键在于缩小估计数据分布与地面数据分布之间的差距,而监督学习(SL)却未能做到这一点。从这一新的角度,我们把称为通用政策超级转换(GPI)的基本模式推广到一个更加普及的版本,即培训过程中获得的数据的分布会改变。DQN发现,这种财产可能会造成培训不稳定,因此它提出了处理财产下方的有效方法。我们不注重不利的方面,我们发现对于RL(DRL)来说,关键在于减轻数据分布估计与地面数据分布之间的差距,而监督学习(SL)未能做到这一点。基于GDI的一些实用的算法建议是核实其有效性和广泛性。LEPI-RIS(G-RAR)将我们22的士级工作成绩测试提高到了我们22个人类水平。