In reinforcement learning, the performance of learning agents is highly sensitive to the choice of time discretization. Agents acting at high frequencies have the best control opportunities, along with some drawbacks, such as possible inefficient exploration and vanishing of the action advantages. The repetition of the actions, i.e., action persistence, comes into help, as it allows the agent to visit wider regions of the state space and improve the estimation of the action effects. In this work, we derive a novel All-Persistence Bellman Operator, which allows an effective use of both the low-persistence experience, by decomposition into sub-transition, and the high-persistence experience, thanks to the introduction of a suitable bootstrap procedure. In this way, we employ transitions collected at any time scale to update simultaneously the action values of the considered persistence set. We prove the contraction property of the All-Persistence Bellman Operator and, based on it, we extend classic Q-learning and DQN. After providing a study on the effects of persistence, we experimentally evaluate our approach in both tabular contexts and more challenging frameworks, including some Atari games.
翻译:在强化学习中,学习代理人的表现对时间分化的选择非常敏感。在高频率行事的代理人拥有最佳的控制机会,还有某些缺点,例如可能低效的探索和消失行动优势。重复行动,即行动持久性,有帮助,因为这样可以让代理人访问国家空间的更广泛地区,并改进对行动效果的估计。在这项工作中,我们产生了一个新的全能贝尔曼操作员,它允许有效地利用低常识经验,将这种经验分解为次过渡阶段,以及高常识经验,因为引入了合适的靴套程序。这样,我们利用在任何时间收集的过渡,同时更新所考虑的持久性成套行动价值。我们证明了全常识贝尔曼操作员的收缩性,并在此基础上,我们扩展了典型的Q-学习和DQN。在对持久性的影响进行研究之后,我们实验地评估了我们在表格背景和更具挑战性的框架,包括一些阿塔里游戏。