In this technical report, the systems we submitted for subtask 1B of the DCASE 2021 challenge, regarding audiovisual scene classification, are described in detail. They are essentially multi-source transformers employing a combination of auditory and visual features to make predictions. These models are evaluated utilizing the macro-averaged multi-class cross-entropy and accuracy metrics. In terms of the macro-averaged multi-class cross-entropy, our best model achieved a score of 0.620 on the validation data. This is slightly better than the performance of the baseline system (0.658). With regard to the accuracy measure, our best model achieved a score of 77.1\% on the validation data, which is about the same as the performance obtained by the baseline system (77.0\%).
翻译:在本技术报告中,我们为DCASE 2021 挑战的子任务1B提供的关于视听场景分类的系统有详细描述,这些系统基本上是多源变压器,采用听觉和视觉功能相结合作出预测,这些模型利用宏观平均多级跨热带和准确度指标进行评估,就宏观平均多级跨热带测量而言,我们的最佳模型在验证数据上达到0.620分,略高于基线系统的性能(0.658)。关于精确度计量,我们的最佳模型在验证数据上取得了77.1分的得分,这与基准系统获得的性能(77.0分)大致相同。