Exploration remains a central challenge for reinforcement learning (RL). Virtually all existing methods share the feature of a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it makes sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic components that make the switching mechanism adaptive and robust, which enables flexibility without an accompanying hyper-parameter-tuning burden. Finally, we report a promising and detailed analysis on Atari, using two-mode exploration and switching at sub-episodic time-scales.
翻译:探索仍然是强化学习的一个中心挑战(RL)。几乎所有现有方法都具有单一行为政策的特点,这种单一行为政策只是逐渐(最佳地)改变。相反,动物和人类的探索行为表现出丰富的多样性,包括不同模式之间的转换形式。本文介绍了对模式转换和非单方体的探索进行的初步研究。我们调查不同模式,以在切换的时标和良好切换触发信号的信号之间转换。我们还提出了实用的算法要素,使转换机制适应性和稳健,从而能够在没有超参数调整负担的情况下实现灵活性。最后,我们报告对Atari进行了有希望和详细的分析,在亚方形时间尺度上使用两种模式的探索和转换。