Training effective embodied AI agents often involves manual reward engineering, expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. In practice, there are few guarantees that these self-supervised objectives encode task-relevant information. We propose the Scene Graph Contrastive (SGC) loss, which uses scene graphs as general-purpose, training-only, supervisory signals. The SGC loss does away with explicit graph decoding and instead uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment. The SGC loss is generally applicable, simple to implement, and encourages representations that encode objects' semantics, relationships, and history. Using the SGC loss, we attain significant gains on three embodied tasks: Object Navigation, Multi-Object Navigation, and Arm Point Navigation. Finally, we present studies and analyses which demonstrate the ability of our trained representation to encode semantic cues about the environment.
翻译:有效的AI代理机构培训往往涉及人工奖励工程、专家仿照、地图等专门部件或利用更多传感器进行深度和本地化。另一种办法是利用神经结构以及自我监督的目标来鼓励更好的代表性学习。实际上,这些自我监督的目标没有多少保证能将任务相关信息编码起来。我们建议使用场景图表作为一般目的、只培训、监督信号,SGC损失不包括清晰的图表解码,而是利用对比性学习来将代理人的表述与其环境的丰富的图形编码相匹配。SGC损失一般适用,易于执行,并鼓励将物体的语义、关系和历史编码的表述。我们利用SGC损失,在三个体现的任务上取得了重大收益:物体导航、多轨道导航和武器点导航。最后,我们提出研究和分析,表明我们受过训练的代表能够将环境的语义符号编码。