回收-增强强化学习 (Retrieval-Augmented Reinforcement Learning)

Anirudh Goyal,Abram L. Friesen,Andrea Banino,Theophane Weber,Nan Rosemary Ke,Adria Puigdomenech Badia,Arthur Guez,Mehdi Mirza,Peter C. Humphreys,Ksenia Konyushkova,Laurent Sifre,Michal Valko,Simon Osindero,Timothy Lillicrap,Nicolas Heess,Charles Blundell

Most deep reinforcement learning (RL) algorithms distill experience into parametric behavior policies or value functions via gradient updates. While effective, this approach has several disadvantages: (1) it is computationally expensive, (2) it can take many updates to integrate experiences into the parametric model, (3) experiences that are not fully integrated do not appropriately influence the agent's behavior, and (4) behavior is limited by the capacity of the model. In this paper we explore an alternative paradigm in which we train a network to map a dataset of past experiences to optimal behavior. Specifically, we augment an RL agent with a retrieval process (parameterized as a neural network) that has direct access to a dataset of experiences. This dataset can come from the agent's past experiences, expert demonstrations, or any other relevant source. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context, to help the agent achieve its goal faster and more efficiently. he proposed method facilitates learning agents that at test-time can condition their behavior on the entire dataset and not only the current state, or current trajectory. We integrate our method into two different RL agents: an offline DQN agent and an online R2D2 agent. In offline multi-task problems, we show that the retrieval-augmented DQN agent avoids task interference and learns faster than the baseline DQN agent. On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores. We run extensive ablations to measure the contributions of the components of our proposed method.

翻译：最深的强化学习( RL) 算法通过梯度更新将经验提炼到参数行为政策或价值函数中。虽然有效, 这种方法具有若干不利之处:(1) 计算成本昂贵, (2) 它可以进行许多更新, 将经验纳入参数模型, (3) 未充分整合的经验不能适当影响代理人的行为, (4) 行为受模型能力的限制。在本文中, 我们探索了另一种模式, 用来将过去经验的数据集映射到最佳行为。具体地说, 我们用一个检索程序( 以神经网络为参数) 增强RL 代理, 直接访问一套经验数据集。这个数据集可以来自该代理过去的经验、专家演示或任何其他相关来源。检索程序经过培训, 从数据集中检索信息, 可能有用, 帮助代理人更快和更有效地实现其目标。他建议的方法有助于在测试时可以决定整个数据集的行为, 而不仅仅是当前状态, 或当前轨迹。我们将我们的方法整合了两个不同的 RLD 代理器的升级。