A highly desirable property of a reinforcement learning (RL) agent -- and a major difficulty for deep RL approaches -- is the ability to generalize policies learned on a few tasks over a high-dimensional observation space to similar tasks not seen during training. Many promising approaches to this challenge consider RL as a process of training two functions simultaneously: a complex nonlinear encoder that maps high-dimensional observations to a latent representation space, and a simple linear policy over this space. We posit that a superior encoder for zero-shot generalization in RL can be trained by using solely an auxiliary SSL objective if the training process encourages the encoder to map behaviorally similar observations to similar representations, as reward-based signal can cause overfitting in the encoder (Raileanu et al., 2021). We propose Cross-Trajectory Representation Learning (CTRL), a method that runs within an RL agent and conditions its encoder to recognize behavioral similarity in observations by applying a novel SSL objective to pairs of trajectories from the agent's policies. CTRL can be viewed as having the same effect as inducing a pseudo-bisimulation metric but, crucially, avoids the use of rewards and associated overfitting risks. Our experiments ablate various components of CTRL and demonstrate that in combination with PPO it achieves better generalization performance on the challenging Procgen benchmark suite (Cobbe et al., 2020).
翻译:强化学习(RL)代理物的一个非常可取的特性 -- -- 深入的RL方法的一个重大困难 -- -- 是能够将在高维观测空间的少数任务中学习的政策与在培训期间没有看到过的类似任务相提并论。许多大有希望的方法认为RL是一个同时培训两个功能的过程:一个复杂的非线性编码器,将高维观测映射到潜在代表空间,以及一个简单的线性政策。我们认为,如果培训过程鼓励编码器将行为上相似的观测与类似表示法进行比对,那么,就只能使用辅助性SSL目标来培训,因为培训过程可以鼓励编码器绘制类似观察法,因为基于奖励的信号可能导致在编码器(Raileanu等人,2021年)中过度匹配。我们提议跨线性代表学习(CTRL)是一个复杂的方法,该方法将高维度观测映射到潜在代表空间的高度显示其观察行为相似性。我们认为,如果将新型的SSL目标应用于该代理人政策的轨迹,那么SLL的目标就只能仅使用辅助性观测目标。 CTRL可以认为它具有更高的预估性效果,从而可以更好地使用我们的基准性实验的组合。