Offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions. There are two major challenges in this setting: (1) extrapolation error caused by approximating the value of state-action pairs not well-covered by the training data and (2) distributional shift between behavior and inference policies. One way to tackle these problems is to induce conservatism - i.e., keeping the learned policies closer to the behavioral ones. To achieve this, we build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model, which we use as a conservative action encoder. This Normalizing Flows action encoder is pre-trained in a supervised manner on the offline dataset, and then an additional policy model - controller in the latent space - is trained via reinforcement learning. This approach avoids querying actions outside of the training dataset and therefore does not require additional regularization for out-of-dataset actions. We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms with generative action models on a large portion of datasets.
翻译:离线强化学习旨在培训关于预先记录和固定的数据集的政策,而没有任何额外的环境互动。在这一背景下,存在两大挑战:(1) 由于培训数据没有很好覆盖的州-对的值的接近,造成外推错误;(2) 行为和推理政策之间的分布变化; 解决这些问题的方法之一是诱发保守主义,即使所学政策更接近于行为政策。 为实现这一点,我们在潜在行动空间学习政策的基础上再接再厉,并使用一种特殊的正常流动模式来构建一种基因化模型,我们作为保守的行动编码器使用。这种正常化流程行动编码器在离线数据集上以监督的方式预先培训,然后通过强化学习来培训另一个政策模型,即潜在空间的控制器。这种方法避免在培训数据集之外查询行动,因此不需要对外部数据设置行动作进一步的规范。我们评估了我们关于各种移动和导航任务的方法,表明我们的方法在近期提出的大型模型上超越了我们的方法。