Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While learned through reuse of static dataset, the dynamics model's generalization ability hopefully promotes policy learning if properly utilized. To that end, several works propose to quantify the uncertainty of predicted dynamics, and explicitly apply it to penalize reward. However, as the dynamics and the reward are intrinsically different factors in context of MDP, characterizing the impact of dynamics uncertainty through reward penalty may incur unexpected tradeoff between model utilization and risk avoidance. In this work, we instead maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game formulation of offline RL. We formally show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. To improve policy, we devise an iterative regularized policy optimization algorithm for the game, with guarantee of monotonous improvement under certain condition. To make practical, we further devise an offline RL algorithm to approximately find the solution. Empirical results show that the proposed approach achieves state-of-the-art performance on a wide range of benchmark tasks.
翻译:以模型为基础的脱线强化学习(RL)旨在通过利用先前收集的静态数据集和动态模型,找到高度有益的政策。动态模型的概括性能力虽然通过重新利用静态数据集学习,但希望通过动态模型的概括性能力促进政策学习,如果利用得当。为此,若干著作提议量化预测动态的不确定性,并明确将其用于惩罚性奖赏。然而,由于动态和奖赏是MDP的内在不同因素,因此通过奖赏惩罚来说明动态不确定性的影响可能会在模型利用和避免风险之间产生出人意料的权衡取舍。在这项工作中,我们反而维持对动态的信仰分布,并通过对信仰的偏差抽样来评估/优化政策。抽样程序偏向悲观主义,其根据离线的马尔科夫游戏的交替性设计来推导出。我们正式表明,偏差抽样自然地引出一种符合政策性重算因素的最新动态信念,称为“悲观-模型的动态偏差”的偏差。为了改进政策,我们为游戏设计一种迭式的常规化政策优化算法,在一定条件下找到单质改进的保证单质性改进。我们进一步的“Empriralalalalal lagalal laction laction laction labal laction lax le le laxalal le laxalalalalmaxx。