Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in various lengths. In this paper, we present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.
翻译:视频中的识别和本地化事件是视频理解的一项基本任务。由于事件可能以听觉和视觉方式发生,多式详细认识是完整的场景理解所必不可少的。以往的多数工作都试图从整体角度分析视频。然而,它们并不考虑多种尺度的语义信息,因此模型难以将不同长度的事件本地化。在本文中,我们展示了一个多式金字塔关注网络(MM-Pyramid),它捕捉和整合视听活动本地化和视听视频剖析的多级时间特征。具体地说,我们首先提议了专注的金字塔模块。这个模块通过几个堆叠金字塔单元捕捉时金字塔特征,每个单元都由固定规模的注意区块和变相区块组成。我们还设计了一个适应性语义融合模块,利用一个单位级关注区块和选择性融合块来互动整合金字塔特征。关于视听事件本地化和薄弱超度视听视频视频拼图任务的广泛实验,以核实我们的方法的有效性。