Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.
翻译:离线强化学习需要调和两个相互矛盾的目标:学习一项政策,在收集数据集的行为政策基础上改进,同时尽量减少偏离行为政策的偏差,以避免因分布式转移而出现错误。这一权衡至关重要,因为大多数当前离线强化学习方法都需要在培训中询问改进政策过程中的无形行动的价值,因此需要要么限制这些行动,使其在分布中,要么规范其价值。我们建议一种离线RL方法,该方法不需要评估数据集之外的行动,但仍然使所学的政策能够大大改进数据中的最佳行为,从而避免由于分布式转移而出现错误。我们工作的主要洞察力是,不是评估最新政策中隐蔽的行动,而是隐含地将国家价值视为随机变量,由行动决定(同时结合动态,以避免过度乐观),然后对这一随机变量设定一个有条件的上限,以估计该状态的最佳行动的价值。这利用了功能对精度的在线顺度的概括能力来评估最新政策 Q 。同时,我们还可以通过不断更新的排序动作显示我们的最佳动作的价值。