Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.
翻译:基于模型的强化学习方法(RL)在离线设置中具有吸引力,因为它们允许代理商在不与环境互动的情况下解释行动的后果。 先前的方法学习了一个一步动态模型, 该模型根据当前状态和行动预测下一个状态。 这些模型不会立即告诉代理商要采取什么行动, 而是必须融入更大的 RL 框架。 我们能否以不同的方式模拟环境动态, 使所学的模型能够直接显示每个行动的价值? 在本文中, 我们提议对比值学习, 学习环境动态的隐含的多步模型。 这个模型可以在没有奖赏功能的情况下学习, 但是可以直接估计每个行动的价值, 而不要求任何TD 学习。 由于这个模型代表着多步过渡, 它避免了以隐含的方式预测高维观测结果, 从而也避免了高维任务的规模。 我们的实验证明 CVL 在复杂的连续控制基准上比前的离线 RL 方法要优。