With the emergence of varied visual navigation tasks (e.g, image-/object-/audio-goal and vision-language navigation) that specify the target in different ways, the community has made appealing advances in training specialized agents capable of handling individual navigation tasks well. Given plenty of embodied navigation tasks and task-specific solutions, we address a more fundamental question: can we learn a single powerful agent that masters not one but multiple navigation tasks concurrently? First, we propose VXN, a large-scale 3D dataset that instantiates four classic navigation tasks in standardized, continuous, and audiovisual-rich environments. Second, we propose Vienna, a versatile embodied navigation agent that simultaneously learns to perform the four navigation tasks with one model. Building upon a full-attentive architecture, Vienna formulates various navigation tasks as a unified, parse-and-query procedure: the target description, augmented with four task embeddings, is comprehensively interpreted into a set of diversified goal vectors, which are refined as the navigation progresses, and used as queries to retrieve supportive context from episodic history for decision making. This enables the reuse of knowledge across navigation tasks with varying input domains/modalities. We empirically demonstrate that, compared with learning each visual navigation task individually, our multitask agent achieves comparable or even better performance with reduced complexity.
翻译:随着各种视觉导航任务(例如图像/物体/目标/视听目标和视觉语言导航)的出现,以不同方式指定了目标,社区在培训能够很好地处理个别导航任务的专业人员方面取得了令人兴奋的进展。鉴于有大量的包含导航任务和具体任务解决方案,我们处理了一个更根本的问题:我们能否学习一个单一的强大媒介,它不能同时主持一个任务,而是同时执行多重导航任务?首先,我们提议VXN,一个大型的3D数据集,在标准化、连续和视听丰富的环境中即时执行四个典型导航任务。第二,我们提议维也纳,一个多功能的、包含多种功能的导航代理人,同时学习用一种模式执行四个导航任务。在充分注意结构的基础上,维也纳设计了各种导航任务,作为统一、简洁和细化的程序:目标说明,加上四个任务嵌入,被全面解释成一套多样化的目标矢量,随着导航进展而完善,并用作查询从具有独特历史背景的辅助性环境,用于决策。我们建议维也纳,一个多功能化的导航代理人,甚至能够重新利用各种导航任务的知识,与每个不同层次的复杂度进行比较,从而显示我们每个不同层次的飞行的学习。