GDI: 重新思考是什么使加强学习有别于监督学习 (GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning)

Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning (DRL) via combining deep learning (DL) with reinforcement learning (RL), which has noticed that the distribution of the acquired data would change during the training process. DQN found this property might cause instability for training, so it proposed effective methods to handle the downside of the property. Instead of focusing on the unfavourable aspects, we find it critical for RL to ease the gap between the estimated data distribution and the ground truth data distribution while supervised learning (SL) fails to do so. From this new perspective, we extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and techniques can be unified into the GDI paradigm, which can be considered as one of the special cases of GDI. We provide theoretical proof of why GDI is better than GPI and how it works. Several practical algorithms based on GDI have been proposed to verify the effectiveness and extensiveness of it. Empirical experiments prove our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200 training frames. Our work aims to lead the RL research to step into the journey of conquering the human world records and seek real superhuman agents on both performance and efficiency.

翻译：深Q网络(DQN)首先通过将深层次学习(DL)与强化学习(RL)相结合,打开了深层次强化学习(DRL)的大门。DQN发现这种财产可能会造成培训不稳定,因此提出了处理财产下方的有效方法。我们发现,对于RL来说,关键在于缩小估计数据分布与地面数据分布之间的差距,而监督学习(SL)却未能做到这一点。从这一新的角度,我们把称为通用政策转换(GPI)的基本模式推广到一个更加普及的版本,即通用数据分配(RL)在培训过程中会改变数据分配。DQN发现,这种属性可能会造成培训不稳定,因此它提出了处理财产下方问题的有效方法。我们发现,对于RLL来说,与其关注不利的方面,我们的关键在于减轻数据分配估计数据分布与地面数据分布之间的差距,而监督学习(SL)未能做到这一点。从这个新的角度,我们把通用政策转换为GLI(GPI)的基本模式,称为GPI(GI)的基本模式,称为“GPI) 高级政策转换(GPI) 高级政策转换(GI),它被称为“通用数据分配”在198世界标准中,其历史记录中,我们人类历史记录(B)只能测算。