Object goal navigation aims to steer an agent towards a target object based on observations of the agent. It is of pivotal importance to design effective visual representations of the observed scene in determining navigation actions. In this paper, we introduce a Visual Transformer Network (VTNet) for learning informative visual representation in navigation. VTNet is a highly effective structure that embodies two key properties for visual representations: First, the relationships among all the object instances in a scene are exploited; Second, the spatial locations of objects and image regions are emphasized so that directional navigation signals can be learned. Furthermore, we also develop a pre-training scheme to associate the visual representations with navigation signals, and thus facilitate navigation policy learning. In a nutshell, VTNet embeds object and region features with their location cues as spatial-aware descriptors and then incorporates all the encoded descriptors through attention operations to achieve informative representation for navigation. Given such visual representations, agents are able to explore the correlations between visual observations and navigation actions. For example, an agent would prioritize "turning right" over "turning left" when the visual representation emphasizes on the right side of activation map. Experiments in the artificial environment AI2-Thor demonstrate that VTNet significantly outperforms state-of-the-art methods in unseen testing environments.
翻译:目标导航的目的是根据代理人的观测结果将一个物剂引向目标对象。 在确定导航行动时,设计观测场景的有效视觉显示至关重要。 在本文中,我们引入了一个视觉变换器网络(VTNet),用于在导航中学习信息化的视觉表现。VTNet是一个非常有效的结构,包含视觉表现的两个关键属性:第一,一个场景中所有物体事件之间的关系都得到了利用;第二,强调物体和图像区域的空间位置,以便能够了解方向导航信号。此外,我们还制定了一个培训前计划,将视觉显示与导航信号联系起来,从而便利导航政策学习。在一个坚神,VTNet嵌入物体和区域特征,其位置提示为空间-觉识描述器,然后通过注意操作整合所有编码的描述器,以获得导航信息化说明。鉴于这种视觉表现,代理人能够探索视觉观察与导航行动之间的相互关系。例如,当图像显示显示显示在激活的地图右侧面时,将“向左倾”置于“向左”,从而便利导航政策学习。在启动式地图的右侧,实验中,在人造电图环境中演示。