Can an autonomous agent navigate in a new environment without building an explicit map? For the task of PointGoal navigation ('Go to $\Delta x$, $\Delta y$') under idealized settings (no RGB-D and actuation noise, perfect GPS+Compass), the answer is a clear 'yes' - map-less neural models composed of task-agnostic components (CNNs and RNNs) trained with large-scale reinforcement learning achieve 100% Success on a standard dataset (Gibson). However, for PointNav in a realistic setting (RGB-D and actuation noise, no GPS+Compass), this is an open question; one we tackle in this paper. The strongest published result for this task is 71.7% Success. First, we identify the main (perhaps, only) cause of the drop in performance: the absence of GPS+Compass. An agent with perfect GPS+Compass faced with RGB-D sensing and actuation noise achieves 99.8% Success (Gibson-v2 val). This suggests that (to paraphrase a meme) robust visual odometry is all we need for realistic PointNav; if we can achieve that, we can ignore the sensing and actuation noise. With that as our operating hypothesis, we scale the dataset and model size, and develop human-annotation-free data-augmentation techniques to train models for visual odometry. We advance the state of art on the Habitat Realistic PointNav Challenge from 71% to 94% Success (+23, 31% relative) and 53% to 74% SPL (+21, 40% relative). While our approach does not saturate or 'solve' this dataset, this strong improvement combined with promising zero-shot sim2real transfer (to a LoCoBot) provides evidence consistent with the hypothesis that explicit mapping may not be necessary for navigation, even in a realistic setting.
翻译:在新的环境中自主代理在不建立清晰的地图的情况下导航吗? 对于点目标导航的任务(在理想化设置(没有 RGB-D 和激活噪音,完美的GPS+Compass )下,“是” - 由任务-认知组件(CNNs和RNNS)组成的无地图神经模型,经过大规模强化学习训练(Gibson),在标准数据集(Gibson)上实现了100%的成功。然而,对于现实环境中的点Nav (RGB-D和动作驱动器噪音,没有GPS-liver+Compass),这是一个尚未解决的问题;我们在本文中处理的一个问题。这个任务的最大公布结果是成功率为71.7%。首先,我们确定由任务-认知组件组成的主要(perhaps)神经模型的下降原因:没有GPS+Compass。一个拥有完美的GPS+Compass在 RGB-Disional 和动作推进方法下(Gibson-vval 方法) 达到99.8%的成功率(Gibson-vval) 和动作变现变换的数值。这表示,我们需要需要我们直观数据规模,我们可以实现整个数据规模数据规模,我们所需要的数据,可以实现。