Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in different lengths. In this paper, we present a Multimodal Pyramid Attentional Network (\textbf{MM-Pyramid}) for event localization. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.
翻译:视频中的识别和本地化事件是视频理解的一项基本任务。由于事件可能以听觉和视觉方式发生,多式详细认识对于全面理解场景至关重要。以往的多数工作都试图从整体角度分析视频。然而,它们并不考虑多种尺度的语义信息,因此模型难以将不同长度的事件本地化。在本文中,我们提出了一个多式金字塔关注网络(\ textbf{MM-Pyramid}),供事件本地化使用。具体地说,我们首先提出注意的金字塔模块。这个模块通过几个堆叠的金字塔单元捕捉到时金字塔特征,每个单元都由固定规模的注意区和变相组合块组成。我们还设计了一个适应性语义融合模块,利用一个单位层面的注意区块和选择性组合块来互动整合金字塔特征。关于视听事件本地化和微弱受监督的视听视频分析任务的广泛实验,以核实我们的方法的有效性。