Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors of variation are introduced during training, the optimization process becomes increasingly more difficult, leading to low sample efficiency and unstable training. Instead of learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation from policy learning. Specifically, SODA imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, while the RL optimization process uses strictly non-augmented data. Empirical evaluations are performed on diverse tasks from DeepMind Control suite as well as a robotic manipulation task, and we find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.
翻译:为提高强化学习方法的普及能力,通过域随机化和数据增强,作出了广泛的努力,但是,随着培训过程中引入更多的差异因素,优化进程变得越来越困难,导致样本效率低和不稳定的培训;我们提议不直接从扩大的数据中学习政策,而是采用SOft数据增强方法,这种方法可以使增强与政策学习脱钩;具体地说,SODA对编码器施加软性限制,目的是最大限度地扩大增强型和非增强型数据的潜在代表之间的相互信息,而RL优化程序则严格使用非强化型数据;对深敏控制套件的不同任务以及机器人操作任务进行了经验性评价,我们发现SODA是为了大大提高样本效率、普及和稳定地培训基于最新愿景的RL方法。