To obtain higher sample efficiency and superior final performance simultaneously has been one of the major challenges for deep reinforcement learning (DRL). Previous work could handle one of these challenges but typically failed to address them concurrently. In this paper, we try to tackle these two challenges simultaneously. To achieve this, we firstly decouple these challenges into two classic RL problems: data richness and exploration-exploitation trade-off. Then, we cast these two problems into the training data distribution optimization problem, namely to obtain desired training data within limited interactions, and address them concurrently via i) explicit modeling and control of the capacity and diversity of behavior policy and ii) more fine-grained and adaptive control of selective/sampling distribution of the behavior policy using a monotonic data distribution optimization. Finally, we integrate this process into Generalized Policy Iteration (GPI) and obtain a more general framework called Generalized Data Distribution Iteration (GDI). We use the GDI framework to introduce operator-based versions of well-known RL methods from DQN to Agent57. Theoretical guarantee of the superiority of GDI compared with GPI is concluded. We also demonstrate our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.33% mean human normalized score (HNS), 1146.39% median HNS and surpassed 22 human world records using only 200M training frames. Our performance is comparable to Agent57's while we consume 500 times less data. We argue that there is still a long way to go before obtaining real superhuman agents in ALE.
翻译:为了同时获得更高的抽样效率和更高的最终绩效,我们把这两个问题放到培训数据分配优化问题中,即在有限的互动中获得理想的培训数据,同时通过i)明确模拟和控制行为政策的能力和多样性,并(ii)更精细和适应性地控制使用单一数据分配优化对行为政策进行选择性/抽样分配的做法。为了做到这一点,我们首先将这些挑战分为两个典型的RL问题:数据丰富和探索-开发交易。然后,我们将这两个问题放到培训数据分配优化问题中,即在有限的互动中同时获得所需的培训数据,同时通过i)明确模拟和控制行为政策的能力和多样性,并(ii)对行为政策的分配进行更精细和适应性控制。在本文件中,我们试图同时利用单一数据分配优化来同时应对这两项挑战。最后,我们将这两个进程纳入通用政策整合(GPIGI)问题。然后,我们利用GDI框架引入了基于操作者的版本,从DQN到AF-57。 与GPI相比,GDI的优势的理论保证比GNI更加精准和适应。 我们的CADA-DA(在使用22年的SAL的SA级标准之前,我们通过SALA级的SALA级的SBA级的成绩评估了我们达到了我们的标准,在22级标准,我们达到了标准。