We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level while neglecting informative correlation between segments of the two modalities and between multi-scale event proposals. We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance to modulate the related auditory, visual, and fused features. In particular, during feature encoding, we propose cross-modal normalization and intra-modal normalization. The former modulates the features of two modalities by establishing and exploiting the cross-modal relationship. The latter modulates the features of a single modality with the event-relevant semantic guidance of the same modality. In the fusion stage,we propose a multi-scale proposal modulating module and a multi-alignment segment modulating module to introduce multi-scale event proposals and enable dense matching between cross-modal segments. With the auditory, visual, and fused features modulated by the correlation information regarding audio-visual events, M2N performs accurate event localization. Extensive experiments conducted on the AVE dataset demonstrate that our proposed method outperforms the state of the art in both supervised event localization and cross-modality localization.
翻译:我们研究视听活动本地化的问题,这些视听活动在视频中既听觉又看得见; 现有工作的重点是在部分一级对视听特征进行编码和调整,同时忽视两种模式各部分之间和多尺度活动提案之间的信息相关性; 我们提议一个新颖的多式模拟网络(M2N),以学习上述相关性,并把它用作调控相关听觉、视觉和连接功能的语义指导; 特别是在功能编码期间,我们提议跨模式正常化和内部正常化; 前者通过建立和利用跨模式关系调整两种模式的特征; 后者以同一模式与事件相关的语义指导调整单一模式的特征; 在聚合阶段,我们提议一个多尺度的调整模块和一个多方向部分调控模块,以引入多规模事件提案,并使跨模式部门之间能够进行密集匹配; 前者通过建立和利用交叉模式关系,调整两种模式的特点; 后者调整单一模式的特征,同时调整与事件相关的语义和语义指导; M2N在组合阶段测试中,我们拟议的本地格式化数据系统化,以当地格式方式展示了我们的拟议地方化活动。