Vision-and-Language Navigation (VLN) is a challenging task in the field of artificial intelligence. Although massive progress has been made in this task over the past few years attributed to breakthroughs in deep vision and language models, it remains tough to build VLN models that can generalize as well as humans. In this paper, we provide a new perspective to improve VLN models. Based on our discovery that snapshots of the same VLN model behave significantly differently even when their success rates are relatively the same, we propose a snapshot-based ensemble solution that leverages predictions among multiple snapshots. Constructed on the snapshots of the existing state-of-the-art (SOTA) model $\circlearrowright$BERT and our past-action-aware modification, our proposed ensemble achieves the new SOTA performance in the R2R dataset challenge in Navigation Error (NE) and Success weighted by Path Length (SPL).
翻译:视觉和语言导航(VLN)是人工智能领域一项具有挑战性的任务。虽然过去几年里由于在深视和语言模型方面的突破,这项任务取得了巨大进展,但是仍然难以建立既能概括人又能概括人的VLN模型。在本文中,我们为改进VLN模型提供了一个新的视角。根据我们发现,同一VLN模型的图片即使其成功率相对相同,其表现也大不相同,因此,我们提出了一个基于快照的合用解决方案,在多个图片中利用预测。根据现有艺术状态模型(SOTA)的相片和我们过去行动-认知的修改,我们拟议的组合实现了导航错误(NE)和路径长度加权成功(SPL)中R2R数据集挑战的新的SOTA性能。