Recent years have seen embodied visual navigation advance in two distinct directions: (i) in equipping the AI agent to follow natural language instructions, and (ii) in making the navigable world multimodal, e.g., audio-visual navigation. However, the real world is not only multimodal, but also often complex, and thus in spite of these advances, agents still need to understand the uncertainty in their actions and seek instructions to navigate. To this end, we present AVLEN~ -- an interactive agent for Audio-Visual-Language Embodied Navigation. Similar to audio-visual navigation tasks, the goal of our embodied agent is to localize an audio event via navigating the 3D visual world; however, the agent may also seek help from a human (oracle), where the assistance is provided in free-form natural language. To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone that learns: (a) high-level policies to choose either audio-cues for navigation or to query the oracle, and (b) lower-level policies to select navigation actions based on its audio-visual and language inputs. The policies are trained via rewarding for the success on the navigation task while minimizing the number of queries to the oracle. To empirically evaluate AVLEN, we present experiments on the SoundSpaces framework for semantic audio-visual navigation tasks. Our results show that equipping the agent to ask for help leads to a clear improvement in performance, especially in challenging cases, e.g., when the sound is unheard during training or in the presence of distractor sounds.
翻译:近些年来,人们看到视觉导航有两个不同方向:(一) 使AI代理机构具备遵守自然语言指示的能力,以及(二) 使可航行的世界多式联运,例如视听导航,但是,真实的世界不仅是多式联运,而且往往是复杂的,因此,尽管取得了这些进步,代理人仍然需要了解其行动的不确定性,并寻求导航指示。为此,我们提出了AVLEN~ -- -- 视听导航的一个互动代理机构。与视听导航任务类似,我们所体现的代理机构的目标是通过对3D视觉世界进行导航,将一个音频事件本地化;然而,代理人还可能从一个以自由形式自然语言提供援助的人类(或奇迹)那里寻求帮助。为了实现这些能力,AVLEN使用一个多式的等级强化学习骨干,学习:(a) 高级政策,选择用于导航的音频显示器,或查询星域。以及(b) 较低级别的政策,选择基于其视听和语言在3D视觉世界中进行导航;然而,该代理机构也可能寻求人类(或语言)的音频导航任务的成功,而这是我们目前进行平稳导航试验时,一个测试的成功经验测试。