Model-based reinforcement learning (MBRL) has recently gained immense interest due to its potential for sample efficiency and ability to incorporate off-policy data. However, designing stable and efficient MBRL algorithms using rich function approximators have remained challenging. To help expose the practical challenges in MBRL and simplify algorithm design from the lens of abstraction, we develop a new framework that casts MBRL as a game between: (1) a policy player, which attempts to maximize rewards under the learned model; (2) a model player, which attempts to fit the real-world data collected by the policy player. For algorithm development, we construct a Stackelberg game between the two players, and show that it can be solved with approximate bi-level optimization. This gives rise to two natural families of algorithms for MBRL based on which player is chosen as the leader in the Stackelberg game. Together, they encapsulate, unify, and generalize many previous MBRL algorithms. Furthermore, our framework is consistent with and provides a clear basis for heuristics known to be important in practice from prior works. Finally, through experiments we validate that our proposed algorithms are highly sample efficient, match the asymptotic performance of model-free policy gradient, and scale gracefully to high-dimensional tasks like dexterous hand manipulation. Additional details and code can be obtained from the project page at https://sites.google.com/view/mbrl-game
翻译:基于模型的加固学习(MBRL)最近由于具有样本效率和纳入离政策数据的能力的潜力而获得了巨大的兴趣。然而,利用丰富的功能相配者设计稳定高效的 MBRL 算法仍然具有挑战性。为了帮助暴露MBRL中的实际挑战,并从抽象角度简化算法设计,我们开发了一个新的框架,将MBRL作为一个游戏,在以下两个游戏中将MBRL作为一个游戏:(1) 一个政策玩家,该玩家试图根据所学的模型最大限度地获得收益;(2) 一个模型玩家,该玩家试图与政策玩家收集的真实的游戏相匹配。关于算法的开发,我们建造了两个玩家之间的Stackelberg游戏,并展示了它可以通过大约双级优化解决它。为了帮助暴露MBRBRL的算法的两种自然的算法组合,根据这个算法选择了玩家作为Stackelberg游戏的领导者。一起,它们概括、统一、概括和概括了许多以前MBRL的算法。此外,我们的框架与人们所知道在前工作实践中很重要的超度数据相匹配的基础。最后,我们通过实验来验证我们所提议的高度和高度的模型,我们所拟议的高度的递定式的平级的算法是高度的精度和高度的精度的精度的精度的精度和高度的精度的精度的精度的精度的精度的精度的精度操作制。