Vision-and-Language Navigation (VLN) requires an agent to find a path to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas. Most existing methods take the words in the instructions and the discrete views of each panorama as the minimal unit of encoding. However, this requires a model to match different nouns (e.g., TV, table) against the same input view feature. In this work, we propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level, namely objects and words. Our sequential BERT also enables the visual-textual clues to be interpreted in light of the temporal context, which is crucial to multi-round VLN tasks. Additionally, we enable the model to identify the relative direction (e.g., left/right/front/back) of each navigable location and the room type (e.g., bedroom, kitchen) of its current and final navigation goal, as such information is widely mentioned in instructions implying the desired next and final locations. We thus enable the model to know-where the objects lie in the images, and to know-where they stand in the scene. Extensive experiments demonstrate the effectiveness compared against several state-of-the-art methods on three indoor VLN tasks: REVERIE, NDH, and R2R. Project repository: https://github.com/YuankaiQi/ORIST
翻译:视觉和语言导航( VLN) 需要代理方在自然语言指令和一套摄影现实全景的基础上找到通往远程位置的路径。 多数现有方法将每个全景的指示和独立观点中的文字作为最小编码单位。 但是, 这需要一种模型, 将不同的名词( 如电视、 表格) 与相同的输入视图特性相匹配。 在这项工作中, 我们提议一个目标知情的连续顺序 BERT, 将视觉感知和语言指示编码在相同的精细级别, 即对象和字。 我们的相继 BERT 还能够根据时间背景来解释视觉- 文字提示, 这对于多轮VLN 任务至关重要。 此外, 我们让模型能够识别每个导航地点和当前和最终导航目标的房间类型( 如卧室、 厨房) 的相对方向( 例如卧室、 厨房) 。 在显示下一个和最后位置的指令中, 这些信息被广泛提及。 我们因此能够将一系列的图像与 RDHR 相比, 。 我们让模型能够展示一些广度的实验方法 。