Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. To alleviate this issue, we design an evaluation protocol for unsupervised RL representations with lower variance and up to 600x lower computational cost. Inspired by the vision community, we propose two linear probing tasks: predicting the reward observed in a given state, and predicting the action of an expert in a given state. These two tasks are generally applicable to many RL domains, and we show through rigorous experimentation that they correlate strongly with the actual downstream control performance on the Atari100k Benchmark. This provides a better method for exploring the space of pretraining algorithms without the need of running RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.
翻译:不受监督的视觉代表性学习为利用无标签的无标签轨迹的大型连体以形成有用的视觉形象提供了机会,这有利于强化学习(RL)算法的培训。然而,评价这种代表性是否适合需要培训RL算法,这种算法的计算密集度和差异性结果高。为了缓解这一问题,我们为不受监督的 RL 表达法设计了一个评价协议,其差异较小,计算成本低至600x。在视觉界的启发下,我们提出两项线性研究任务:预测在特定状态观察到的奖赏,预测某一状态的专家的行动。这两项任务一般适用于许多RL 域。我们通过严格的实验表明,它们与Atari100k 基准的实际下游控制绩效密切相关。这为探索培训前的算法空间提供了更好的方法,而无需对每种环境都进行RL 进行评价。我们利用这一框架,进一步改进了现有自我监督的RL 食谱(SSL),强调前方模型的重要性,视觉骨架的大小,以及准确的配置。