The use of multiple and semantically correlated sources can provide complementary information to each other that may not be evident when working with individual modalities on their own. In this context, multi-modal models can help producing more accurate and robust predictions in machine learning tasks where audio-visual data is available. This paper presents a multi-modal model for automatic scene classification that exploits simultaneously auditory and visual information. The proposed approach makes use of two separate networks which are respectively trained in isolation on audio and visual data, so that each network specializes in a given modality. The visual subnetwork is a pre-trained VGG16 model followed by a bidiretional recurrent layer, while the residual audio subnetwork is based on stacked squeeze-excitation convolutional blocks trained from scratch. After training each subnetwork, the fusion of information from the audio and visual streams is performed at two different stages. The early fusion stage combines features resulting from the last convolutional block of the respective subnetworks at different time steps to feed a bidirectional recurrent structure. The late fusion stage combines the output of the early fusion stage with the independent predictions provided by the two subnetworks, resulting in the final prediction. We evaluate the method using the recently published TAU Audio-Visual Urban Scenes 2021, which contains synchronized audio and video recordings from 12 European cities in 10 different scene classes. The proposed model has been shown to provide an excellent trade-off between prediction performance (86.5%) and system complexity (15M parameters) in the evaluation results of the DCASE 2021 Challenge.
翻译:使用多语种和语义相关来源可以相互提供补充信息,而这种信息本身在与单个模式合作时可能并不明显。在这方面,多模式模型可以帮助在有视听数据的情况下,在机器学习任务中产生更准确和更可靠的预测。本文件提供了一个自动场景分类的多模式,同时利用听觉和视觉信息。拟议方法使用两个不同的网络,分别进行视听数据隔离培训,使每个网络在特定模式下进行专门化。视觉子网络是经过预先训练的 VGG16 模型,然后是预演的经常性参数,而其余的音频子网络则以从抓起训练的堆叠式挤动共振动区为基础。在每个子网络培训后,视听流中的信息汇集在两个不同的阶段进行。早期融合阶段结合了两个不同的网络在不同的时间步骤上最后的变动区块所产生的特点,用于为一个双向的经常结构提供。 末调阶段结合了预调阶段的VGG16模型,然后是预选的经常性参数,而其余的音频子网络则以堆积点为基础,在两个亚级系统上展示了21期的预估测结果。