通过对比性事件物体对准和语义混合对视听现场进行分类 (Audio-visual scene classification via contrastive event-object alignment and semantic-based fusion)

Previous works on scene classification are mainly based on audio or visual signals, while humans perceive the environmental scenes through multiple senses. Recent studies on audio-visual scene classification separately fine-tune the largescale audio and image pre-trained models on the target dataset, then either fuse the intermediate representations of the audio model and the visual model, or fuse the coarse-grained decision of both models at the clip level. Such methods ignore the detailed audio events and visual objects in audio-visual scenes (AVS), while humans often identify different scenes through audio events and visual objects within and the congruence between them. To exploit the fine-grained information of audio events and visual objects in AVS, and coordinate the implicit relationship between audio events and visual objects, this paper proposes a multibranch model equipped with contrastive event-object alignment (CEOA) and semantic-based fusion (SF) for AVSC. CEOA aims to align the learned embeddings of audio events and visual objects by comparing the difference between audio-visual event-object pairs. Then, visual objects associated with certain audio events and vice versa are accentuated by cross-attention and undergo SF for semantic-level fusion. Experiments show that: 1) the proposed AVSC model equipped with CEOA and SF outperforms the results of audio-only and visual-only models, i.e., the audio-visual results are better than the results from a single modality. 2) CEOA aligns the embeddings of audio events and related visual objects on a fine-grained level, and the SF effectively integrates both; 3) Compared with other large-scale integrated systems, the proposed model shows competitive performance, even without using additional datasets and data augmentation tricks.

翻译：先前的现场分类工作主要以视听信号为基础,而人类则通过多种感知感知来看待环境场景。最近对视听场景分类的研究分别对目标数据集上的大型视听场景分类进行了微调,对目标数据集上的大型视听场景和图像预培训模型之间的隐含关系进行了协调,然后将音频模型和视觉模型的中间显示部分结合起来,或者将两个模型在剪辑级级上的粗度决定结合起来。这种方法忽略视听场景中详细的音频事件和视觉对象,而人类则经常通过视听事件和视觉对象内部的趋同来辨别不同的场景。然后,将视听事件和视觉对象的精细微信息分开,并协调音频事件和视觉对象之间的隐含关系,本文提议多处模型,配有对比性活动定位和视觉模型的粗略表现,将SEFOA-SFA 和SOFA 的精细度、SO-SO-SO-SL-SO-SL 和VI-I-SOL-SL 的高级结果演示、SU-SU-SU-I-SUD-SL-SL-SU-I-I-I-I-I-I-I-I-I-IL-IL-IOL-I-I-I-I-I-I-SF-I-I-I-I-I-IF-SF-SF-I-I-I-I-I-SF-IOL-I-I-SF-ID-ID-IOLOL-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-IL-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I