Deep reinforcement learning (RL) algorithms suffer severe performance degradation when the interaction data is scarce, which limits their real-world application. Recently, visual representation learning has been shown to be effective and promising for boosting sample efficiency in RL. These methods usually rely on contrastive learning and data augmentation to train a transition model for state prediction, which is different from how the model is used in RL--performing value-based planning. Accordingly, the learned representation by these visual methods may be good for recognition but not optimal for estimating state value and solving the decision problem. To address this issue, we propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. More specifically, VCR trains a model to predict the future state (also referred to as the ''imagined state'') based on the current one and a sequence of actions. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. Then a distance is computed and minimized to force the imagined state to produce a similar action value prediction as that by the real state. We develop two implementations of the above idea for the discrete and continuous action spaces respectively. We conduct experiments on Atari 100K and DeepMind Control Suite benchmarks to validate their effectiveness for improving sample efficiency. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
翻译:深度强化学习( RL) 算法在互动数据稀缺时会发生严重的性能退化,这限制了它们的实际应用。最近,视觉代表性学习已证明对提高RL样本效率是有效和有希望的。这些方法通常依靠对比性学习和数据增强来培训国家预测过渡模型,这与RL基于价值的规划使用模型的方式不同。因此,这些视觉方法的学习表现方式可能有利于识别,但对于估计国家价值和解决决定问题来说并不最理想。为了解决这个问题,我们建议了一种新颖的方法,称为价值一致的代言学习(VCR),以学习与决策直接相关的表现。更具体地说,VCR培训一种模型,以根据当前“想象状态”和行动顺序来预测未来状态(也称为“想象状态 ” ) 的过渡模式。 VCRR在两个州都应用了免费的面值头来估计国家的价值,并获得两种行动价值的分布。然后将距离计算和最小化为与决策直接相关的代表度的搜索方式,分别根据目前的“想象状态”来预测未来( ) 实现我们不断的精确的定位,我们不断的精确的比值。