Robotic tasks such as manipulation with visual inputs require image features that capture the physical properties of the scene, e.g., the position and configuration of objects. Recently, it has been suggested to learn such features in an unsupervised manner from simulated, self-supervised, robot interaction; the idea being that high-level physical properties are well captured by modern physical simulators, and their representation from visual inputs may transfer well to the real world. In particular, learning methods based on noise contrastive estimation have shown promising results. To robustify the simulation-to-real transfer, domain randomization (DR) was suggested for learning features that are invariant to irrelevant visual properties such as textures or lighting. In this work, however, we show that a naive application of DR to unsupervised learning based on contrastive estimation does not promote invariance, as the loss function maximizes mutual information between the features and both the relevant and irrelevant visual properties. We propose a simple modification of the contrastive loss to fix this, exploiting the fact that we can control the simulated randomization of visual properties. Our approach learns physical features that are significantly more robust to visual domain variation, as we demonstrate using both rigid and non-rigid objects.
翻译:机械化任务,例如用视觉输入进行操纵,需要图像特征,以捕捉现场的物理特性,例如物体的位置和配置。最近,有人提议从模拟、自我监督、机器人互动中以不受监督的方式从模拟、自我监督、机器人互动中学习这些特征;认为高层次物理特性被现代物理模拟器很好地捕捉,而其从视觉输入的表述方式可能向真实世界转移得更好。特别是,基于噪音对比估计的学习方法显示了有希望的结果。为了巩固模拟到真实的传输,建议了域随机化(DR),以学习那些不具有不相干的视觉特性,例如纹理或照明等不相关的视觉特性。然而,在这项工作中,我们表明,基于对比估计的无监督性学习高层次物理特性的天真应用不会促进变化,因为损失函数会最大限度地增加特征之间以及相关和不相关的视觉属性之间的相互信息。我们提议简单修改对比性损失以解决这个问题,利用我们能够控制视觉属性的模拟随机转换的事实。我们的方法将物理特性转化为的视野,我们学习的僵硬性特性,我们用不牢固的域来展示。