Weakly-supervised audio-visual violence detection aims to distinguish snippets containing multimodal violence events with video-level labels. Many prior works perform audio-visual integration and interaction in an early or intermediate manner, yet overlooking the modality heterogeneousness over the weakly-supervised setting. In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning. To address these issues, we propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way. Then audio and visual violent semi-bag representations are assembled as positive pairs, and violent semi-bags are combined with background and normal instances in the opposite modality as contrastive negative pairs. Furthermore, a self-distillation module is applied to transfer unimodal visual knowledge to the audio-visual model, which alleviates noises and closes the semantic gap between unimodal and multimodal features. Experiments show that our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset. Results also demonstrate that our proposed approach can be used as plug-in modules to enhance other networks. Codes are available at https://github.com/JustinYuu/MACIL_SD.
翻译:微弱监督的视听暴力检测旨在区分含有多式暴力事件的片段, 包括视频级标签; 许多先前的作品以早期或中间方式进行视听整合和互动, 却忽略了模式的多样化, 忽略了监管不力的环境。 在本文中, 我们分析多式学习( MIL) 程序模式的无节制和无区别现象, 进一步调查其对监管不力的视听学习的负面影响。 为了解决这些问题, 我们提议采用自我蒸馏( MACIL- SD) 战略的有模式意识对比实例学习。 具体地说, 我们利用轻量双流网络生成音频和视觉袋, 其中单式背景、 暴力和正常的场景以不受监督的方式组合成半袋。 然后, 视听暴力的半袋展示会聚集成积极的对配对, 暴力的半袋展示会与反式模式的背景和普通实例相结合。 此外, 自我淡化的二流模式模块会用来产生音频变异式的图像模型, 显示我们以前使用的磁性变式模型 。