The successes of deep Reinforcement Learning (RL) are limited to settings where we have a large stream of online experiences, but applying RL in the data-efficient setting with limited access to online interactions is still challenging. A key to data-efficient RL is good value estimation, but current methods in this space fail to fully utilise the structure of the trajectory data gathered from the environment. In this paper, we treat the transition data of the MDP as a graph, and define a novel backup operator, Graph Backup, which exploits this graph structure for better value estimation. Compared to multi-step backup methods such as $n$-step $Q$-Learning and TD($\lambda$), Graph Backup can perform counterfactual credit assignment and gives stable value estimates for a state regardless of which trajectory the state is sampled from. Our method, when combined with popular value-based methods, provides improved performance over one-step and multi-step methods on a suite of data-efficient RL benchmarks including MiniGrid, Minatar and Atari100K. We further analyse the reasons for this performance boost through a novel visualisation of the transition graphs of Atari games.
翻译:深强化学习(RL)的成功仅限于我们拥有大量在线经验的环境,但在数据效率环境下适用RL(RL)仍然具有挑战性。数据效率RL(RL)的关键是良好的价值估计,但目前这个空间中的方法未能充分利用从环境收集的轨迹数据结构。在本文中,我们将MDP的过渡数据作为图表处理,并定义一个新的备份操作员“图备份”,利用这个图表结构来更好地估算价值。与多步备份方法相比,如美元分步的美元(Q-learning)和TD($\lambda$),图表备份可以进行反事实信用分配,并给出稳定的价值估计,而不论从哪个轨道上采集的状态。我们的方法与流行的基于价值的方法相结合,在包括MiniGrid、Minatar和Atari100K在内的一套数据效率RL基准的一步和多步方法上提供了改进的性能。我们通过对阿塔里游戏转型图进行创新的直观分析,进一步分析提高绩效的原因。