In reinforcement learning (RL), it is easier to solve a task if given a good representation. While deep RL should automatically acquire such good representations, prior work often finds that learning representations in an end-to-end fashion is unstable and instead equip RL algorithms with additional representation learning parts (e.g., auxiliary losses, data augmentation). How can we design RL algorithms that directly acquire good representations? In this paper, instead of adding representation learning parts to an existing RL algorithm, we show (contrastive) representation learning methods can be cast as RL algorithms in their own right. To do this, we build upon prior work and apply contrastive representation learning to action-labeled trajectories, in such a way that the (inner product of) learned representations exactly corresponds to a goal-conditioned value function. We use this idea to reinterpret a prior RL method as performing contrastive learning, and then use the idea to propose a much simpler method that achieves similar performance. Across a range of goal-conditioned RL tasks, we demonstrate that contrastive RL methods achieve higher success rates than prior non-contrastive methods, including in the offline RL setting. We also show that contrastive RL outperforms prior methods on image-based tasks, without using data augmentation or auxiliary objectives.
翻译:在强化学习(RL)中,如果给出一个良好的表达方式,就比较容易解决任务。虽然深层的RL应该自动获得这样的良好表达方式,但先前的工作往往发现,以端到端方式进行的学习表现方式是不稳定的,而是为RL算法配备额外的代表学习部分(例如辅助损失、数据增强等),我们如何设计直接获得良好表达方式的RL算法?在本文中,我们没有将代表学习部分添加到现有的RL算法中,而是显示(互动)代表学习方法可以作为RL的算法。为了做到这一点,我们以先前的工作为基础,对行动标记的轨迹进行对比性代表学习,这样,(内部产品)所学的表达方式就完全符合一个有目标限制的价值功能。我们用这个想法重新解释先前的RL算法来进行对比性学习,然后用这个想法提出一个更简单的方法来达到类似的表现。在一系列有目标限制的RL任务中,我们证明对比性RL方法比先前的不使用前非修正的恢复性任务达到更高的成功率。