Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.
翻译:在有丰富和吵闹投入的环境中,代理商只能在没有环境互动的情况下获得固定的数据集。过去的工作提议了基于州代表制培训前的通用变通方法,随后是政策培训。在这项工作中,我们引入了一种简单而有效的州代表制学习方法。我们的方法,行为前代表制(BPR),学习州代表制,以基于数据集行为克隆的简单至整体的目标:我们首先从数据集中通过模仿动作学习国家代表制,然后在固定代表制之上培训一项政策,使用任何现成的离线 RL 算法。理论上,我们证明BPR在融入具有政策改进保障(保守算法)或产生政策价值较低界限的算法(悲观算法)时,履行业绩保障。我们很生动地表明,BPR结合现有的“离线 ROL ” 状态算法,在几个离线控制基准上取得了显著改进。