Offline reinforcement learning (offline RL), which aims to find an optimal policy from a previously collected static dataset, bears algorithmic difficulties due to function approximation errors from out-of-distribution (OOD) data points. To this end, offline RL algorithms adopt either a constraint or a penalty term that explicitly guides the policy to stay close to the given dataset. However, prior methods typically require accurate estimation of the behavior policy or sampling from OOD data points, which themselves can be a non-trivial problem. Moreover, these methods under-utilize the generalization ability of deep neural networks and often fall into suboptimal solutions too close to the given dataset. In this work, we propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution. We show that the clipped Q-learning, a technique widely used in online RL, can be leveraged to successfully penalize OOD data points with high prediction uncertainties. Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning. Based on this observation, we propose an ensemble-diversified actor-critic algorithm that reduces the number of required ensemble networks down to a tenth compared to the naive ensemble while achieving state-of-the-art performance on most of the D4RL benchmarks considered.
翻译:离线强化学习(离线 RL) 旨在从先前收集的静态数据集中找到最佳政策(离线 RL ), 并由于运行分配外(OOOD) 数据点的近似差错而产生算法困难。 为此, 离线 RL 算法采用限制或惩罚条件, 明确指导该政策接近给定数据集。 但是, 先前的方法通常要求准确估计行为政策或从 OOOD 数据点取样, 而OOOD 数据点本身可能是一个非三重问题。 此外, 这些方法未充分利用深层神经网络的普及能力,并往往陷入与给定数据集过于接近的亚最佳解决方案。 在这项工作中, 我们提议基于不确定性的离线 RL 算法方法,该基于Q值预测的信心,并不要求对数据分布进行任何估计或抽样。 我们表示, 剪切的 Q- 学习方法, 一种在在线RL 中广泛使用的技术, 能够成功地惩罚 OODD 数据点, 具有较高的预测不确定性。 令人怀疑, 我们发现, 它有可能大大超出与给给给给给给给定的亚性网络现有的离线的亚性网络的亚化 QQQL, 而要在各种的 RL 学习 。