Language understanding is essential for the navigation agent to follow instructions. We observe two kinds of issues in the instructions that can make the navigation task challenging: 1. The mentioned landmarks are not recognizable by the navigation agent due to the different vision abilities of the instructor and the modeled agent. 2. The mentioned landmarks are applicable to multiple targets, thus not distinctive for selecting the target among the candidate viewpoints. To deal with these issues, we design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations at each step. The translator needs to focus on the recognizable and distinctive landmarks based on the agent's visual abilities and the observed visual environment. To achieve this goal, we create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent. We evaluate our approach on Room2Room~(R2R), Room4room~(R4R), and Room2Room Last (R2R-Last) datasets and achieve state-of-the-art results on multiple benchmarks.
翻译:语言理解对于导航代理遵循指示至关重要。我们观察到导航代理在指示中存在两种问题,使导航任务具有挑战性:1. 导航代理由于教员和模型代理的不同视力能力,无法识别上述地标。2. 所述地标适用于多个目标,因此在候选人观点中选择目标并不独特。为了处理这些问题,我们为导航代理设计了一个翻译模块,将原始指示转换成每步容易执行的次级指示演示。翻译需要侧重于基于该代理的视觉能力和观测到的视觉环境的可识别和独特的地标。为实现这一目标,我们创建了新的合成次级指示数据集,并设计具体任务来培训笔译和导航代理。我们评价我们在2room~(R2R)室、4room~(R4R)室和Lom 2Room Last室(R2R-Last)数据集上的做法,并在多个基准上取得最新结果。