In model-based reinforcement learning, an agent can leverage a learned model to improve its way of behaving in different ways. Two prevalent approaches are decision-time planning and background planning. In this study, we are interested in understanding under what conditions and in which settings one of these two planning styles will perform better than the other in domains that require fast responses. After viewing them through the lens of dynamic programming, we first consider the classical instantiations of these planning styles and provide theoretical results and hypotheses on which one will perform better in the pure planning, planning & learning, and transfer learning settings. We then consider the modern instantiations of these planning styles and provide hypotheses on which one will perform better in the last two of the considered settings. Lastly, we perform several illustrative experiments to empirically validate both our theoretical results and hypotheses. Overall, our findings suggest that even though decision-time planning does not perform as well as background planning in their classical instantiations, in their modern instantiations, it can perform on par or better than background planning in both the planning & learning and transfer learning settings.
翻译:在基于模型的强化学习中,一个代理机构可以利用一个学习的模型来改善其不同方式的行为方式。两种流行的方法是决策时间规划和背景规划。在本研究中,我们有兴趣了解这两种规划方式中的一种在何种条件下和在何种情况下会比其他情况下在需要快速反应的领域表现更好。在从动态编程的角度来看待它们之后,我们首先考虑这些规划方式的典型即时反应,并提供理论结果和假设,在纯规划、规划和学习以及转移学习环境方面,人们将对其表现更好。然后我们考虑这些规划方式的现代即时反应,并提供假设,说明在最后两种考虑的环境下,一个人将表现更好。最后,我们进行了一些说明性实验,以实证我们的理论结果和假设。总体而言,我们的研究结果表明,尽管决定时间规划在它们的典型即时、现代即时反应中不起作用,但是在规划、学习和转移学习环境中,它可以比背景规划或更好的表现。