This work identifies a common flaw of deep reinforcement learning (RL) algorithms: a tendency to rely on early interactions and ignore useful evidence encountered later. Because of training on progressively growing datasets, deep RL agents incur a risk of overfitting to earlier experiences, negatively affecting the rest of the learning process. Inspired by cognitive science, we refer to this effect as the primacy bias. Through a series of experiments, we dissect the algorithmic aspects of deep RL that exacerbate this bias. We then propose a simple yet generally-applicable mechanism that tackles the primacy bias by periodically resetting a part of the agent. We apply this mechanism to algorithms in both discrete (Atari 100k) and continuous action (DeepMind Control Suite) domains, consistently improving their performance.
翻译:这项工作确定了深层强化学习算法的一个常见缺陷:一种依赖早期互动和忽视后来发现的有用证据的倾向。由于关于逐步增加数据集的培训,深层RL代理商有过度适应早期经验的风险,对学习过程的其余部分产生消极影响。受认知科学的启发,我们将这种效果称为首要偏差。通过一系列实验,我们解析深层RL的算法方面加剧了这种偏差。然后我们提出一个简单但普遍适用的机制,通过定期重新设置代理商的一部分来解决首要偏差问题。我们将这一机制应用于离散(Atari 100k)和连续操作(DepMind Control Setage)领域的算法,并不断改进其性能。