In reinforcement learning, it is common to let an agent interact for a fixed amount of time with its environment before resetting it and repeating the process in a series of episodes. The task that the agent has to learn can either be to maximize its performance over (i) that fixed period, or (ii) an indefinite period where time limits are only used during training to diversify experience. In this paper, we provide a formal account for how time limits could effectively be handled in each of the two cases and explain why not doing so can cause state aliasing and invalidation of experience replay, leading to suboptimal policies and training instability. In case (i), we argue that the terminations due to time limits are in fact part of the environment, and thus a notion of the remaining time should be included as part of the agent's input to avoid violation of the Markov property. In case (ii), the time limits are not part of the environment and are only used to facilitate learning. We argue that this insight should be incorporated by bootstrapping from the value of the state at the end of each partial episode. For both cases, we illustrate empirically the significance of our considerations in improving the performance and stability of existing reinforcement learning algorithms, showing state-of-the-art results on several control tasks.
翻译:在强化学习中,通常的做法是让代理人在一定的时间里与其环境互动,然后重新确定其环境,并在一系列过程中重复这一过程。代理人必须学习的任务可以是最大限度地在(一) 固定时期或(二) 无限期的时期内使其业绩达到最大程度;(二) 只在培训期间使用时限,使经验多样化。在本文件中,我们正式说明在两种情况下如何有效地处理时限问题,并解释为何不这样做就会导致在重新发挥经验时说明别名和无效,导致不尽理想的政策和培训不稳定。在(一) 情况下,我们争辩说,由于时间限制而终止的情况实际上是环境的一部分,因此,剩余时间的概念应该作为代理人为避免侵犯Markov财产而提供的投入的一部分。在(二) 情况下,时限不是环境的一部分,而只是用来促进学习。我们主张,这种认识应该通过在每一部分事件结束时从国家的价值中汲取教训来纳入。在这两起案件中,我们从经验的角度来说明我们如何加强现有任务稳定性和稳定性方面的各种考虑的重要性。