Can an autonomous agent navigate in a new environment without building an explicit map? For the task of PointGoal navigation ('Go to $\Delta x$, $\Delta y$') under idealized settings (no RGB-D and actuation noise, perfect GPS+Compass), the answer is a clear 'yes' - map-less neural models composed of task-agnostic components (CNNs and RNNs) trained with large-scale reinforcement learning achieve 100% Success on a standard dataset (Gibson). However, for PointNav in a realistic setting (RGB-D and actuation noise, no GPS+Compass), this is an open question; one we tackle in this paper. The strongest published result for this task is 71.7% Success. First, we identify the main (perhaps, only) cause of the drop in performance: the absence of GPS+Compass. An agent with perfect GPS+Compass faced with RGB-D sensing and actuation noise achieves 99.8% Success (Gibson-v2 val). This suggests that (to paraphrase a meme) robust visual odometry is all we need for realistic PointNav; if we can achieve that, we can ignore the sensing and actuation noise. With that as our operating hypothesis, we scale the dataset and model size, and develop human-annotation-free data-augmentation techniques to train models for visual odometry. We advance the state of art on the Habitat Realistic PointNav Challenge from 71% to 94% Success (+32, 4% relative) and 53% to 74% SPL (+39, 6% relative). While our approach does not saturate or 'solve' this dataset, this strong improvement combined with promising zero-shot sim2real transfer (to a LoCoBot) provides evidence consistent with the hypothesis that explicit mapping may not be necessary for navigation, even in a realistic setting.
翻译:在新的环境中自主代理在不建立清晰的地图的情况下能否在新的环境中导航? 对于点目标导航的任务(在理想化设置下“去$\Delta x$, $\Delta $, $\Delta $$ $”) (没有 RGB- D和启动噪音, 完美的 GPS+Compass ),答案是明确的“是 ” -没有地图的神经模型, 由任务-认知组件( CNN 和 RNNNs) 组成, 受过大规模强化学习训练的神经模型( GNIB+Compass) 在标准数据集( Gib- D) 上取得了100%的成功。然而,对于现实环境下的点Nav导航( RGB- D 和动作显示音效噪音, 没有OBS- Rio- Riodal), 是一个开放式的问题;我们在这个纸张中处理一个最强烈的结果是成功的结果。 首先,我们确定主要( perhaps) 由任务- sorto) 组成, commal commission 6 和动作转换为我们所有视觉数据, 我们可以在 Slobisal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal demodustration (我们需要) 。