PSLA: 改进预培训、抽样、标签和聚合方面的音频拖网 (PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation)

Audio tagging is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present PSLA, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices. By training an EfficientNet with these techniques, we obtain a single model (with 13.6M parameters) and an ensemble model that achieve mean average precision (mAP) scores of 0.444 and 0.474 on AudioSet, respectively, outperforming the previous best system of 0.439 with 81M parameters. In addition, our model also achieves a new state-of-the-art mAP of 0.567 on FSD50K.

翻译：音频标记是一个活跃的研究领域,应用范围很广。自《AudioSet》发布以来,在推进模型性能方面取得了很大进展,这主要来自开发新型模型结构和关注模块。然而,我们发现,适当的培训技术对于用《AudioSet》建立音频标记模型同样重要,但没有得到应有的重视。为了填补这一空白,我们向PSLA展示了一批培训技术,这些技术可以明显提高模型准确性,包括图像网络预培训、均衡抽样、数据增强、标签强化、模型汇总及其设计选择。通过以这些技术培训高效网络,我们获得了一个单一模型(13.6M参数)和一个组合模型,在音频Set上达到平均平均精确分数0.444和0.474,这分别超过了以往具有81M参数的0.439最佳系统。此外,我们的模型还在FSD50K上实现了一个新的最先进的MAP,即0.567。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/