Learning state representations enables robotic planning directly from raw observations such as images. Most methods learn state representations by utilizing losses based on the reconstruction of the raw observations from a lower-dimensional latent space. The similarity between observations in the space of images is often assumed and used as a proxy for estimating similarity between the underlying states of the system. However, observations commonly contain task-irrelevant factors of variation which are nonetheless important for reconstruction, such as varying lighting and different camera viewpoints. In this work, we define relevant evaluation metrics and perform a thorough study of different loss functions for state representation learning. We show that models exploiting task priors, such as Siamese networks with a simple contrastive loss, outperform reconstruction-based representations in visual task planning.
翻译:多数方法都是利用从低维潜质空间重建原始观测所得出的损失来了解状态表现。图像空间观测的相似性常常被假定,并用作估计系统基础状态之间相似性的替代物。然而,观测通常包含与任务有关的变化因素,这些变化因素对于重建仍然很重要,例如不同的照明和不同的相机观点。在这项工作中,我们界定了相关的评价指标,并对国家代表性学习的不同损失功能进行了彻底研究。我们显示,模型利用任务前科,例如具有简单对比性损失的Siamse网络,在视觉任务规划中表现得落后于基于任务的重建。