We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
翻译:我们发现并研究政策杂交现象,即贪婪政策在基于价值的强化学习中的迅速变化。 政策杂交以惊人的快速速度运作,在少数的学习更新中(在典型的深度RL设置中,比如对Atari的DQN)改变大部分国家的贪婪行动。 我们用经验来描述这种现象,核实它并不局限于特定的算法或环境特性。一些推理有助于减少关于为什么杂交发生于少数的、都与深层次的学习有关的合理解释。 最后,我们假设政策杂交是一种有益但被忽视的隐含探索形式,在新的光线下进行“epsilon-greedy ” 的探索,即“$\epsilon-noise”的作用比预期的要小得多。