Advances in reinforcement learning (RL) often rely on massive compute resources and remain notoriously sample inefficient. In contrast, the human brain is able to efficiently learn effective control strategies using limited resources. This raises the question whether insights from neuroscience can be used to improve current RL methods. Predictive processing is a popular theoretical framework which maintains that the human brain is actively seeking to minimize surprise. We show that recurrent neural networks which predict their own sensory states can be leveraged to minimise surprise, yielding substantial gains in cumulative reward. Specifically, we present the Predictive Processing Proximal Policy Optimization (P4O) agent; an actor-critic reinforcement learning agent that applies predictive processing to a recurrent variant of the PPO algorithm by integrating a world model in its hidden state. P4O significantly outperforms a baseline recurrent variant of the PPO algorithm on multiple Atari games using a single GPU. It also outperforms other state-of-the-art agents given the same wall-clock time and exceeds human gamer performance on multiple games including Seaquest, which is a particularly challenging environment in the Atari domain. Altogether, our work underscores how insights from the field of neuroscience may support the development of more capable and efficient artificial agents.
翻译:强化学习(RL)的进展往往依赖于大规模计算资源,并且仍然有声名狼藉的样本效率低下。相比之下,人类大脑能够高效地学习使用有限资源的有效控制战略。这就提出了一个问题,即神经科学的洞察力能否用于改进当前RL方法。预测性处理是一个流行的理论框架,它坚持认为人类大脑正在积极寻求最大限度地减少意外。我们表明,预测其自身感官状态的反复出现的神经网络可以被用来最大限度地减少意外,从而在累积奖励方面获得大量收益。具体地说,我们介绍了预测处理最佳最佳政策优化代理(P4O);一个演员-critic强化学习代理,将预测性处理应用到PPO演算法的经常变体中,将世界模型融入其隐藏状态。PPO演算法的基线常变体大大超出使用单一的GPUPU的亚塔里游戏。我们的工作洞察力也超过了其他状态的艺术代理,在同一个时钟时段里超越了人类游戏的游戏表现,包括海产公司,而海产公司是一个特别具有挑战性的环境,在阿塔里具有更强的神经力的域内,我们的工作洞察力会如何强调如何从一个领域和人造能。