Most recent work in visual sound source localization relies on semantic audio-visual representations learned in a self-supervised manner, and by design excludes temporal information present in videos. While it proves to be effective for widely used benchmark datasets, the method falls short for challenging scenarios like urban traffic. This work introduces temporal context into the state-of-the-art methods for sound source localization in urban scenes using optical flow as a means to encode motion information. An analysis of the strengths and weaknesses of our methods helps us better understand the problem of visual sound source localization and sheds light on open challenges for audio-visual scene understanding.
翻译:最近的视觉声源定位工作大多依赖于在自监督学习中学习的语义音频-视觉表示,并通过设计排除了视频中存在的时域信息。虽然这种方法被证明对于广泛使用的基准数据集非常有效,但在像城市交通这样具有挑战性的场景中,该方法并不够完善。本工作通过使用光流作为编码运动信息的手段,在最先进的城市场景中用于声源定位的方法中引入了时间上下文。对我们方法的优点和缺点进行分析有助于我们更好地理解视觉声源定位的问题,并为音频-视觉场景理解的开放性挑战提供一些启示。