We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection. Employing feature fusion, we adapt a baseline system utilizing only spectral acoustic inputs to also make use of pretrained auditory and visual features, extracted from networks built for different tasks and trained with external data. We perform experiments with these modified models on an audiovisual multi-label data set, of which the training partition contains a large number of unlabeled samples and a smaller amount of clips with weak annotations, indicating the clip-level presence of 10 sound categories without specifying the temporal boundaries of the active auditory events. For clip-based audio tagging, this transfer learning method grants marked improvements. Addition of the visual modality on top of audio also proves to be advantageous in this context. When it comes to generating transcriptions of audio recordings, the benefit of pretrained features depends on the requested temporal resolution: for coarse-grained sound event detection, their utility remains notable. But when more fine-grained predictions are required, performance gains are strongly reduced due to a mismatch between the problem at hand and the goals of the models from which the pretrained vectors were obtained.
翻译:我们研究转让学习两个声音识别问题的优点,即音频标签和音频事件探测。使用特征聚合,我们调整一个基准系统,仅使用光谱声学输入,以便也使用为不同任务而建立的网络和经过外部数据培训的预先培训的听觉和视觉特征。我们在视听多标签数据集上进行这些修改模型的实验,其中培训分区含有大量未贴标签的样本和少量带有微弱注释的剪辑,表明10个音级的存在,而没有具体说明现行监听活动的时间界限。对于剪辑的音频标记,这种传输学习方法授予显著的改进。在音频顶部添加视觉模式也证明在这方面有利。在制作录音誊本时,预先培训特征的好处取决于所要求的时间分辨率:对于粗微的音频事件探测,其效用仍然值得注意。但是,如果需要更精确的预测,由于手头的问题与用于培训前矢量媒介的模型的目标不吻合,业绩的提高将大大降低。