基于自我视角的音频视觉目标定位 (Egocentric Audio-Visual Object Localization)

Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the ``free'' self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes.

翻译：人类自然地通过将声音和视觉统一在第一人称视角中感知周围场景。同样，机器学习也通过从自我视角的多感官输入中学习来接近人类智能。在本文中，我们探讨了具有挑战性的基于自我视角的音频视觉目标定位任务，观察到：1）自我运动通常存在于第一人称的录像中，即使是短时间内也是如此；2）当佩戴者转移注意时，会产生视线外的声音成分。为了解决第一个问题，我们提出了一种几何感知的时间聚合模块来显式地处理自我运动。通过估算时间几何变换并利用它来更新视觉表示来减轻自我运动的影响。此外，我们提出了一种级联特征增强模块来处理第二个问题。它通过分离视觉表示指示的音频表示来提高跨模态定位的鲁棒性。在训练中，我们利用自然可用的音频-视觉时间同步作为“免费”的自监督来避免昂贵的标注。此外，我们还注释并创建了Epic Sounding Object数据集用于评估目的。大量实验表明，我们的方法在自我视角视频中实现了最新的定位性能，并可推广到各种音频视觉场景。

相关内容