Extensive efforts have been made to improve the generalization ability of Reinforcement Learning (RL) methods via domain randomization and data augmentation. However, as more factors of variation are introduced during training, optimization becomes increasingly challenging, and empirically may result in lower sample efficiency and unstable training. Instead of learning policies directly from augmented data, we propose SOft Data Augmentation (SODA), a method that decouples augmentation from policy learning. Specifically, SODA imposes a soft constraint on the encoder that aims to maximize the mutual information between latent representations of augmented and non-augmented data, while the RL optimization process uses strictly non-augmented data. Empirical evaluations are performed on diverse tasks from DeepMind Control suite as well as a robotic manipulation task, and we find SODA to significantly advance sample efficiency, generalization, and stability in training over state-of-the-art vision-based RL methods.
翻译:现已作出广泛努力,通过域随机化和数据增强,提高加强学习方法的普及能力,但是,随着培训期间引入更多的差异因素,优化变得日益具有挑战性,从经验上可能导致抽样效率较低和不稳定的培训;我们提议不直接从增加的数据中学习政策,而是采用SOft数据增强方法,这种方法使增强与政策学习脱钩;具体地说,SODA对编码器施加软性限制,目的是尽量扩大增强型和非强化型数据的潜在表现之间的相互信息,而RL优化程序则严格使用非强化型数据;对DeepMind控制套件的不同任务以及机器人操纵任务进行经验性评价,我们发现SODA大大提高抽样效率、普遍化和稳定地培训最先进的基于愿景的RL方法。