Visual navigation for autonomous agents is a core task in the fields of computer vision and robotics. Learning-based methods, such as deep reinforcement learning, have the potential to outperform the classical solutions developed for this task; however, they come at a significantly increased computational load. Through this work, we design a novel approach that focuses on performing better or comparable to the existing learning-based solutions but under a clear time/computational budget. To this end, we propose a method to encode vital scene semantics such as traversable paths, unexplored areas, and observed scene objects -- alongside raw visual streams such as RGB, depth, and semantic segmentation masks -- into a semantically informed, top-down egocentric map representation. Further, to enable the effective use of this information, we introduce a novel 2-D map attention mechanism, based on the successful multi-layer Transformer networks. We conduct experiments on 3-D reconstructed indoor PointGoal visual navigation and demonstrate the effectiveness of our approach. We show that by using our novel attention schema and auxiliary rewards to better utilize scene semantics, we outperform multiple baselines trained with only raw inputs or implicit semantic information while operating with an 80% decrease in the agent's experience.
翻译:自动代理器的视觉导航是计算机视觉和机器人领域的一项核心任务。 深强化学习等基于学习的方法有可能超越为这一任务开发的古典解决方案; 但是,这些方法的计算负荷明显增加。 通过这项工作,我们设计了一种新颖的方法,侧重于更好或与现有的基于学习的解决方案相仿,但以明确的时间/计算预算为基础。 为此,我们提出一种方法,将重要现场语义,如可穿行路径、未探索区域以及观测到的景象物体 -- -- 连同原始视觉流,如RGB、深度和语义分解掩码等 -- -- 整合成一个自上至下以自我为中心的语义表达式典型解决方案。 此外,为了能够有效地使用这些信息,我们根据成功的多层次变换网络,引入了一个新的2D地图关注机制。 我们进行了3D重建的室内点目标视觉导航实验,并展示了我们的方法的有效性。 我们展示了利用我们的新式的注意力和辅助奖赏, 来更好地利用现场语义学、 深度和语义面分隔面面遮掩体, 我们用了80 隐含的基底基底基线, 。