Recently, audio-visual scene classification (AVSC) has attracted increasing attention from multidisciplinary communities. Previous studies tended to adopt a pipeline training strategy, which uses well-trained visual and acoustic encoders to extract high-level representations (embeddings) first, then utilizes them to train the audio-visual classifier. In this way, the extracted embeddings are well suited for uni-modal classifiers, but not necessarily suited for multi-modal ones. In this paper, we propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task. Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training. We evaluate the approach on the development dataset of TAU Urban Audio-Visual Scenes 2021. The experimental results show that our proposed approach achieves significant improvement over the conventional pipeline training strategy. Moreover, our best single system outperforms previous state-of-the-art methods, yielding a log loss of 0.1517 and accuracy of 94.59% on the official test fold.
翻译:最近,视听场景分类(AVSC)吸引了多学科社区越来越多的关注。以前的研究倾向于采取编审培训战略,首先使用训练有素的视觉和声学编码器提取高级别代表(集会),然后利用它们培训视听分类器。这样,所提取的嵌入器非常适合单式分类器,但不一定适合多式分类器。在本文中,我们提议了一个联合培训框架,直接使用声学特征和原始图像作为AVSC任务的投入。具体地说,我们检索了作为视觉编码器的经过训练的图像模型底层,并在培训期间共同优化现场分类器和1D-CNN的声学编码器。我们评估了TAU城市视听摄像仪2021的开发数据集。实验结果显示,我们拟议的方法大大改进了常规管道培训战略。此外,我们最好的单一系统比以往的状态方法要好,导致正式测试折叠上0.1517和94.59%的精确度。