Learning good feature representations is important for deep reinforcement learning (RL). However, with limited experience, RL often suffers from data inefficiency for training. For un-experienced or less-experienced trajectories (i.e., state-action sequences), the lack of data limits the use of them for better feature learning. In this work, we propose a novel method, dubbed PlayVirtual, which augments cycle-consistent virtual trajectories to enhance the data efficiency for RL feature representation learning. Specifically, PlayVirtual predicts future states based on the current state and action by a dynamics model and then predicts the previous states by a backward dynamics model, which forms a trajectory cycle. Based on this, we augment the actions to generate a large amount of virtual state-action trajectories. Being free of groudtruth state supervision, we enforce a trajectory to meet the cycle consistency constraint, which can significantly enhance the data efficiency. We validate the effectiveness of our designs on the Atari and DeepMind Control Suite benchmarks. Our method outperforms the current state-of-the-art methods by a large margin on both benchmarks.
翻译:学习良好的特征表现对于深层强化学习十分重要。 但是,由于经验有限,学习良好的特征表现往往缺乏培训数据效率。对于缺乏经验或经验较少的轨迹(即状态-动作序列),缺乏数据限制了它们用于更好的特征学习。在这项工作中,我们提出一种新的方法,即所谓的“游戏虚拟轨迹”,以强化周期一致的虚拟轨迹,提高学习RL特征表现的数据效率。具体地说,根据当前状态和动态模型的行动预测未来状态,然后用一个形成轨迹周期的后向动态模型预测以前的状态。在此基础上,我们加大行动力度,产生大量虚拟状态-动作轨迹。我们没有Groudtruth状态监督,我们实施一个轨迹轨迹,以满足周期一致性制约,这可以大大提高数据效率。我们验证了我们在阿塔里和深敏度控制套件基准上的设计的有效性。我们的方法比当前水平差的方法都高出大基准。