In this paper, we propose two techniques, namely joint modeling and data augmentation, to improve system performances for audio-visual scene classification (AVSC). We employ pre-trained networks trained only on image data sets to extract video embedding; whereas for audio embedding models, we decide to train them from scratch. We explore different neural network architectures for joint modeling to effectively combine the video and audio modalities. Moreover, data augmentation strategies are investigated to increase audio-visual training set size. For the video modality the effectiveness of several operations in RandAugment is verified. An audio-video joint mixup scheme is proposed to further improve AVSC performances. Evaluated on the development set of TAU Urban Audio Visual Scenes 2021, our final system can achieve the best accuracy of 94.2% among all single AVSC systems submitted to DCASE 2021 Task 1b.
翻译:在本文中,我们提出了两种技术,即联合建模和数据扩增,以提高视听场景分类系统性能(AVSC),我们只使用通过图像数据集培训的预先培训的网络来提取视频嵌入;对于音频嵌入模型,我们决定从零开始对其进行培训;我们探索不同的神经网络结构,以便联合建模,有效地将视频和音频模式结合起来;此外,还研究数据扩增战略,以提高视听培训的成套规模;对于视频模式,验证了RandAugment的若干操作的有效性;提议了一项视听联合混合计划,以进一步改进AVSC的性能;对TAU城市音频视觉场2021的成套开发进行了评估,我们的最后系统可以在提交DCASE 2021任务1b的所有单一的AVSC系统中实现94.2%的最佳精确度。