Reinforcement Learning (RL) has been shown effective in domains where the agent can learn policies by actively interacting with its operating environment. However, if we change the RL scheme to offline setting where the agent can only update its policy via static datasets, one of the major issues in offline reinforcement learning emerges, i.e. distributional shift. We propose a Pessimistic Offline Reinforcement Learning (PessORL) algorithm to actively lead the agent back to the area where it is familiar by manipulating the value function. We focus on problems caused by out-of-distribution (OOD) states, and deliberately penalize high values at states that are absent in the training dataset, so that the learned pessimistic value function lower bounds the true value anywhere within the state space. We evaluate the PessORL algorithm on various benchmark tasks, where we show that our method gains better performance by explicitly handling OOD states, when compared to those methods merely considering OOD actions.
翻译:强化学习(RL)在代理商可以通过与其操作环境积极互动来学习政策的领域显示有效。 但是,如果我们将RL计划改为离线设置,使代理商只能通过静态数据集更新其政策,在离线强化学习中出现的主要问题之一,即分布式转换。我们建议采用悲观的离线强化学习(PessorL)算法,通过调控价值功能,积极引导代理商回到其熟悉的领域。我们侧重于分配之外的状态造成的问题,并故意惩罚培训数据集中缺失的状态的高值,以便所学的悲观值功能降低州空间内任何地方的真实值。我们评估PessorL在各种基准任务上的算法,我们表明我们的方法通过明确处理OOD状态而获得更好的表现,而与这些方法相比,仅仅考虑OD动作。