NavQ：为前瞻性视觉语言导航学习Q模型 (NavQ: Learning a Q-Model for Foresighted Vision-and-Language Navigation)

In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.

翻译：本文聚焦于目标导向的视觉语言导航任务。现有方法通常基于历史信息做出决策，忽视了行动的未来影响与长期结果。相比之下，我们致力于开发一种具有前瞻性的智能体。具体而言，我们借鉴Q学习方法，利用大规模无标注轨迹数据训练一个Q模型，以学习室内场景布局与物体关系的通用知识。该模型可为每个候选动作生成一个Q特征（类似于传统Q网络中的Q值），用于描述执行特定动作后可能观察到的潜在未来信息。随后，一个跨模态未来编码器将任务无关的Q特征与导航指令相融合，生成一组反映未来前景的动作评分。这些评分与基于历史的原始评分相结合，可驱动一种A*式搜索策略，从而有效探索更可能通往目的地的区域。在广泛使用的目标导向视觉语言导航数据集上进行的大量实验验证了所提方法的有效性。