While agents trained by Reinforcement Learning (RL) can solve increasingly challenging tasks directly from visual observations, generalizing learned skills to novel environments remains very challenging. Extensive use of data augmentation is a promising technique for improving generalization in RL, but it is often found to decrease sample efficiency and can even lead to divergence. In this paper, we investigate causes of instability when using data augmentation in common off-policy RL algorithms. We identify two problems, both rooted in high-variance Q-targets. Based on our findings, we propose a simple yet effective technique for stabilizing this class of algorithms under augmentation. We perform extensive empirical evaluation of image-based RL using both ConvNets and Vision Transformers (ViT) on a family of benchmarks based on DeepMind Control Suite, as well as in robotic manipulation tasks. Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL in environments with unseen visuals. We further show that our method scales to RL with ViT-based architectures, and that data augmentation may be especially important in this setting.
翻译:由强化学习(RL)培训的代理商可以直接从视觉观测解决日益具有挑战性的任务,而将学到的技能推广到新环境则仍然非常具有挑战性。广泛使用数据增强是改进RL通用化的一个很有希望的技术,但通常发现它会降低样本效率,甚至会导致差异。在本文中,我们调查在使用共同的离政策RL算法中的数据增强时不稳定的原因。我们发现两个问题,两个根源都在于高差异Q目标。根据我们的调查结果,我们建议一种简单而有效的技术来稳定正在增长的这一类算法。我们利用ConvNet和愿景变异器对基于图像的RL进行广泛的实证评估,在基于深点控制软件以及机器人操纵任务的基准系列中进行。我们的方法极大地提高了正在增强的ConvNet的稳定性和样本效率,并实现了在有看不见视觉的环境中与基于图像的RL的状态方法相竞争的通用结果。我们进一步表明,我们用VIT结构对基于图像的RL的方法尺度进行了广泛的实验性评估,而数据增强在这种结构中可能特别重要。