We propose Learning Off-Policy with Online Planning (LOOP), an efficient reinforcement learning framework, combining the benefits of model-based local trajectory optimization and off-policy algorithms. The agent learns a dynamics model and then uses trajectory optimization with the learned model to select actions. To sidestep the myopic effect of fixed-horizon trajectory optimization, a value function learned through an off-policy algorithm is attached to the end of the planning horizon. We investigate various instantiations of this framework and demonstrate its benefit in three settings: online reinforcement learning, offline reinforcement learning, and safe learning. We show that this method significantly improves the underlying model-based and model-free algorithms and achieves state-of-the-art performance in a variety of settings.
翻译:我们建议与在线规划(LOOP)一起学习非政策性学习,这是一个高效强化学习框架,结合基于模型的地方轨迹优化和离政策算法的好处。代理商学习一个动态模型,然后使用与所学模型的轨迹优化来选择行动。为避免固定偏顺轨迹优化的短视效应,一个通过离政策算法学习的价值观功能附在规划视野的末尾。我们调查这一框架的各种瞬间,并在三个环境中展示其益处:在线强化学习、离线强化学习和安全学习。我们证明,这种方法极大地改进了基于模型和不使用模型的原始算法,并在各种环境中实现了最先进的业绩。