Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks. The code is available at \url{https://github.com/bit1029public/offline_bpr}.
翻译:在有丰富和噪音投入的环境下进行离线强化学习(RL)在有丰富和吵闹投入的环境下进行离线强化学习(RL)斗争,在这种环境中代理人只能获得固定的数据集,而没有环境相互作用。过去的工作已经根据州代表制培训前培训提出共同的变通办法,然后进行政策培训。在这项工作中,我们引入了一种简单而有效的方法来学习州代表制。我们的方法,即行为前代表制(BPR),以基于对数据集的行为克隆的简单至综合的目标来学习州代表制:我们首先通过模拟数据集的行动来学习国家代表制,然后在固定代表制之上培训一项政策,使用任何离线离线的离线RL算法。理论上,我们证明在融入具有政策改进保障(保守算法)或产生政策价值较低界限(悲观算法)的算法时,BPR可以进行绩效保障。我们很自然代表制结合现有的“离线”离线Roffline RL算法算法在几个离线控制基准上取得显著的改进。代码可在\urgipal_offroffrum10/combs.