Offline reinforcement learning (RL), also known as batch RL, offers the prospect of policy optimization from large pre-recorded datasets without online environment interaction. It addresses challenges with regard to the cost of data collection and safety, both of which are particularly pertinent to real-world applications of RL. Unfortunately, most off-policy algorithms perform poorly when learning from a fixed dataset. In this paper, we propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR). We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces -- outperforming several state-of-the-art offline RL algorithms by a significant margin on a wide range of benchmark tasks.
翻译:离线强化学习(RL),又称分批RL,提供了从没有在线环境互动的大型预先记录的数据集中优化政策的前景,解决了数据收集和安全成本方面的挑战,两者都与RL的现实应用特别相关。 不幸的是,大多数离政策算法在从固定数据集中学习时表现不佳。在本文中,我们提出一个新的离线RL算法,以利用批评者-正规回归(CRR)的形式从数据中学习政策。我们发现,CRR表现得惊人,规模和规模都与高维度状态和行动空间的任务相比 -- -- 在一系列广泛的基准任务上,比一些最先进的离线RL算法表现得要差很多。