Generalisation to unseen contexts remains a challenge for embodied navigation agents. In the context of semantic audio-visual navigation (SAVi) tasks, the notion of generalisation should include both generalising to unseen indoor visual scenes as well as generalising to unheard sounding objects. However, previous SAVi task definitions do not include evaluation conditions on truly novel sounding objects, resorting instead to evaluating agents on unheard sound clips of known objects; meanwhile, previous SAVi methods do not include explicit mechanisms for incorporating domain knowledge about object and region semantics. These weaknesses limit the development and assessment of models' abilities to generalise their learned experience. In this work, we introduce the use of knowledge-driven scene priors in the semantic audio-visual embodied navigation task: we combine semantic information from our novel knowledge graph that encodes object-region relations, spatial knowledge from dual Graph Encoder Networks, and background knowledge from a series of pre-training tasks -- all within a reinforcement learning framework for audio-visual navigation. We also define a new audio-visual navigation sub-task, where agents are evaluated on novel sounding objects, as opposed to unheard clips of known objects. We show improvements over strong baselines in generalisation to unseen regions and novel sounding objects, within the Habitat-Matterport3D simulation environment, under the SoundSpaces task.
翻译:对隐含导航物剂而言,对隐蔽环境的概括化仍是一项挑战。在语义视听导航任务(SAVi)的任务中,一般化的概念应包括将知识驱动的场景前科用于隐蔽的室内视觉场景,以及将未听到的已知物体的音频剪辑中,对已知物体的视觉剪辑进行评估,而以前SAVi的任务定义不包括对未听到的音频剪辑中的物剂进行评估;同时,以前的SAVi方法并不包括纳入关于对象和区域语义学的域域知识的明确机制。这些弱点限制了模型普及其所学经验的能力的开发和评估。在这项工作中,我们将知识驱动的场景前科应用在语义视听成型导航任务中:我们从我们的新知识图中收集的语义性资料,其中编码物体与区域的关系,对已知物体的空间图象学知识,以及一系列训练前任务的背景知识 -- 所有这些都是在视听导航的强化学习框架内的。我们还定义了一个新的视听导航子塔克,其中对新型的音频物体进行了严格的评价,而不是在一般的轨道上展示,我们所了解的轨道任务中,在一般的轨道上显示的轨道上,我们所知道的轨道上,在一般的轨道上,而不是在一般的轨道上显示的轨道上显示的微微微的轨道上的物体。