We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning methods (BYOL-A) as feature extractors. Fourteen downstream tasks are used for evaluation ranging from music instrument classification to language classification. Our results indicate that AT pre-trained models are an excellent transfer learning choice for music, event, and emotion recognition tasks. Further, finetuning AT models can also benefit speech-related tasks such as keyword spotting and intent classification.
翻译:我们研究训练前监督薄弱的音频标记模型(AT)作为一般音频演示的特征提取器的可用性,我们主要分析将这些嵌入到语音和音响域内的其他任务的可行性,具体地说,我们参照作为特征提取器的现代自我监督学习方法(BYOL-A),将受监督薄弱的音频标记模型(AT)作为基准,14项下游任务用于从音乐仪器分类到语言分类的评价。我们的结果表明,AT预先培训模型是音乐、事件和情感识别任务中出色的传输学习选择。此外,微调AT模型还可以有利于与语音有关的任务,如关键词识别和意图分类。