The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.
翻译:所学的无模式离线强化学习方法(RL)政策往往局限于支持数据集,以避免可能的危险分配外行动或国家,从而难以应对支持区域。基于模型的RL方法通过产生具有受过训练的前向或反向动态模型的想象轨迹,提供了更丰富的数据集和惠益的概括化。然而,想象的转变可能不准确,从而降低了离线下线强化方法的性能。在本文中,我们提议通过使用经过培训的双向动态模型和双向检查的推出政策来增加离线数据集。我们通过信任样本来引入由前向模式和后向模式所同意的保守主义。我们的方法、信心觉察双向离线模型的想象力、生成可靠的样本,并与任何无模式的离线RL方法相结合。 D4RL基准的实验结果表明,我们的方法大大提升了现有无模式离线下逻辑算法的性能,并实现了与基线方法相比具有竞争力或更好的分数。