In many scenarios, observations from more than one sensor modality are available for reinforcement learning (RL). For example, many agents can perceive their internal state via proprioceptive sensors but must infer the environment's state from high-dimensional observations such as images. For image-based RL, a variety of self-supervised representation learning approaches exist to improve performance and sample complexity. These approaches learn the image representation in isolation. However, including proprioception can help representation learning algorithms to focus on relevant aspects and guide them toward finding better representations. Hence, in this work, we propose using Recurrent State Space Models to fuse all available sensory information into a single consistent representation. We combine reconstruction-based and contrastive approaches for training, which allows using the most appropriate method for each sensor modality. For example, we can use reconstruction for proprioception and a contrastive loss for images. We demonstrate the benefits of utilizing proprioception in learning representations for RL on a large set of experiments. Furthermore, we show that our joint representations significantly improve performance compared to a post hoc combination of image representations and proprioception.
翻译:在许多情景中,从不止一种传感器模式的观测可用于强化学习(RL)。例如,许多物剂可以通过自行感知感应传感器来感知其内部状态,但必须从图像等高维观测中推断出环境状态。对于基于图像的RL,存在各种自我监督的代表学习方法来提高性能和样本复杂性。这些方法可以孤立地了解图像的表达方式。但是,包括自我感应可以帮助代表学习算法,以关注相关方面,并指导他们找到更好的表述方式。因此,我们提议在这项工作中,利用经常性国家空间模型将所有可用感应信息整合成一个单一的一致的表述方式。我们把基于重建的和对比式的培训方法结合起来,从而允许对每一种感应模式使用最适当的方法。例如,我们可以利用重建的自我感应力和对比性损失图像。我们展示了在为RL进行大规模实验时在学习表达方式时使用读感应带来的好处。此外,我们表明,我们的联合表述与图像表示和自我感知的事后组合相比,我们的联合表现方式大大改进了绩效。