Offline reinforcement learning proposes to learn policies from large collected datasets without interacting with the physical environment. These algorithms have made it possible to learn useful skills from data that can then be deployed in the environment in real-world settings where interactions may be costly or dangerous, such as autonomous driving or factories. However, current algorithms overfit to the dataset they are trained on and exhibit poor out-of-distribution generalization to the environment when deployed. In this paper, we study the effectiveness of performing data augmentations on the state space, and study 7 different augmentation schemes and how they behave with existing offline RL algorithms. We then combine the best data performing augmentation scheme with a state-of-the-art Q-learning technique, and improve the function approximation of the Q-networks by smoothening out the learned state-action space. We experimentally show that using this Surprisingly Simple Self-Supervision technique in RL (S4RL), we significantly improve over the current state-of-the-art algorithms on offline robot learning environments such as MetaWorld [1] and RoboSuite [2,3], and benchmark datasets such as D4RL [4].
翻译:离线强化学习建议从大型收集的数据集中学习政策,而不与物理环境互动。这些算法使得有可能从数据中学习有用的技能,而这些数据可以部署在现实世界环境中,在现实环境中,相互作用可能费用昂贵或危险,例如自主驱动或工厂。然而,目前的算法与数据集相配,在它们被部署时,它们受过训练,在分布上向环境展示了差强人意的自我超视法。在本文件中,我们研究了在州空间上进行数据扩增的有效性,并研究了7个不同的扩增方案,以及它们与现有离线RL算法的操作方式。然后,我们将最佳的数据执行增强方案与最先进的Q学习技术相结合,并通过平滑所学的状态行动空间来改进Q网络的功能近距离。我们实验性地表明,在RL(S4R4)中,我们使用这种奇特的简单自我超视像技术,大大改进了目前离线机器人学习环境上的最新算法,例如MetWorld[1]和RoboSite[2],以及基准数据设置[4]。