Effective decision making involves flexibly relating past experiences and relevant contextual information to a novel situation. In deep reinforcement learning (RL), the dominant paradigm is for an agent to amortise information that helps decision making into its network weights via gradient descent on training losses. Here, we pursue an alternative approach in which agents can utilise large-scale context sensitive database lookups to support their parametric computations. This allows agents to directly learn in an end-to-end manner to utilise relevant information to inform their outputs. In addition, new information can be attended to by the agent, without retraining, by simply augmenting the retrieval dataset. We study this approach for offline RL in 9x9 Go, a challenging game for which the vast combinatorial state space privileges generalisation over direct matching to past experiences. We leverage fast, approximate nearest neighbor techniques in order to retrieve relevant data from a set of tens of millions of expert demonstration states. Attending to this information provides a significant boost to prediction accuracy and game-play performance over simply using these demonstrations as training trajectories, providing a compelling demonstration of the value of large-scale retrieval in offline RL agents.
翻译:有效的决策涉及将过去的经验和相关背景信息与新情况灵活地联系起来。 在深入强化学习(RL)中,主导范式是让一个代理商将信息进行摊销,帮助通过培训损失的梯度下降来决定其网络重量。在这里,我们采取另一种办法,使代理商利用大规模背景敏感数据库的外观,支持其参数计算。这让代理商能够以端到端的方式直接学习,利用相关信息为产出提供信息。此外,新的信息可以由代理商在不进行再培训的情况下,通过仅仅增加检索数据集来提供。我们研究了9x9 Go的离线RL的这一方法,这是一个具有挑战性的游戏,其巨大的组合状态空间特权比过去的经验直接匹配。我们利用了近邻的快速技术,从数以百万计的专家演示状态检索相关数据。这一信息极大地推动了预测准确性和游戏性表现,而只是将这些演示用作培训轨迹,从而有力地展示了远程代理商大规模检索的价值。