Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.
翻译:离线强化学习(RL) 能够学习有效的政策, 仅利用现有的、 静态的数据集, 无需花费昂贵的在线互动。 为了做到这一点, 离线 RL 方法必须处理数据集和学习的政策之间的分布变化。 最常用的方法是学习保守的, 或较低限制的值函数, 这些功能低估了分配外( OOOD) 行动的回报。 但是, 这种方法显示出一个明显的缺点: 这种价值功能的最佳化政策只能按照固定的, 可能不理想的, 保守程度的。 但是, 这样做可以缓解。 但是, 如果我们在培训时能够学习不同程度的保守度政策, 并且设计一种在评价期间动态地选择其中一种功能的方法。 为了这样做, 我们提议学习价值功能, 以保守程度为额外条件, 以信任为条件的值函数。 我们推出一种新形式的 Bellman 备份形式, 能够同时学习任何高度概率的Q值。 但是, 我们的数值功能通过调整在信任度上, 我们的多级的观察, 使得在评估中能够根据最终的信心水平来调整我们所理解的常规评估。