A household robot should be able to navigate to target locations without requiring users to first annotate everything in their home. Current approaches to this object navigation challenge do not test on real robots and rely on expensive semantically labeled 3D meshes. In this work, our aim is an agent that builds self-supervised models of the world via exploration, the same as a child might. We propose an end-to-end self-supervised embodied agent that leverages exploration to train a semantic segmentation model of 3D objects, and uses those representations to learn an object navigation policy purely from self-labeled 3D meshes. The key insight is that embodied agents can leverage location consistency as a supervision signal - collecting images from different views/angles and applying contrastive learning to fine-tune a semantic segmentation model. In our experiments, we observe that our framework performs better than other self-supervised baselines and competitively with supervised baselines, in both simulation and when deployed in real houses.
翻译:家用机器人应该能够导航到目标地点, 而不需要用户首先说明他们家中的所有事物。 目前对物体导航挑战的处理方法不是在真正的机器人上测试,而是依靠昂贵的3D代代meshes 。 在这项工作中, 我们的目标是一个通过探索来建立自我监督的世界模型的代理, 和孩子一样。 我们提议一个端到端自我监督的自我监督化代理器, 利用勘探来训练3D对象的语义分割模型, 并使用这些演示来学习纯粹从自标的 3D meshes 中学习物体导航政策。 关键的洞察力是, 体现的代理器可以利用位置一致性作为监督信号 — 从不同观点/ 角度收集图像, 并应用对比性学习来微调一个语义分割模型。 在我们的实验中, 我们观察到我们的框架比其他自监管基线运行得更好, 并且有竞争力, 在模拟和在实际房屋中部署时, 。