This paper studies audio-visual suppression for egocentric videos -- where the speaker is not captured in the video. Instead, potential noise sources are visible on screen with the camera emulating the off-screen speaker's view of the outside world. This setting is different from prior work in audio-visual speech enhancement that relies on lip and facial visuals. In this paper, we first demonstrate that egocentric visual information is helpful for noise suppression. We compare object recognition and action classification based visual feature extractors, and investigate methods to align audio and visual representations. Then, we examine different fusion strategies for the aligned features, and locations within the noise suppression model to incorporate visual information. Experiments demonstrate that visual features are most helpful when used to generate additive correction masks. Finally, in order to ensure that the visual features are discriminative with respect to different noise types, we introduce a multi-task learning framework that jointly optimizes audio-visual noise suppression and video based acoustic event detection. This proposed multi-task framework outperforms the audio only baseline on all metrics, including a 0.16 PESQ improvement. Extensive ablations reveal the improved performance of the proposed model with multiple active distractors, over all noise types and across different SNRs.
翻译:本文研究以自我为中心的视频的视听抑制, 使发言者没有在视频中被摄取。 相反, 潜在的噪音源在屏幕上可见, 摄像头以模拟屏幕外演讲者对外部世界的观点。 这一设置不同于以前依靠嘴唇和面部视觉增强视听语言的工作。 在本文中, 我们首先证明以自我为中心的视觉信息有助于抑制噪音。 我们比较基于视觉特征的物体识别和动作分类提取器, 并调查调和音频和视觉演示的方法。 然后, 我们检查对齐功能的不同聚合战略, 以及噪音抑制模型中的位置, 以纳入视觉信息。 实验表明, 视觉特征在用于生成添加性校正面具时最为有用。 最后, 为了确保视觉特征在不同的噪音类型上具有歧视性, 我们引入了一个多任务学习框架, 共同优化视听噪音抑制和基于声学事件的探测。 这个拟议的多任务框架超越了所有测量指标的音频基线, 包括0. 16 PESQ的改进。 广泛的布局显示, 显示, 以多种活跃的 RRC 和所有类型, 改进的频率。