Audio tagging is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present PSLA, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices. By training an EfficientNet with these techniques, we obtain a single model (with 13.6M parameters) and an ensemble model that achieve mean average precision (mAP) scores of 0.444 and 0.474 on AudioSet, respectively, outperforming the previous best system of 0.439 with 81M parameters. In addition, our model also achieves a new state-of-the-art mAP of 0.567 on FSD50K.
翻译:音频标记是一个活跃的研究领域,应用范围很广。自《AudioSet》发布以来,在推进模型性能方面取得了很大进展,这主要来自开发新型模型结构和关注模块。然而,我们发现,适当的培训技术对于用《AudioSet》建立音频标记模型同样重要,但没有得到应有的重视。为了填补这一空白,我们向PSLA展示了一批培训技术,这些技术可以明显提高模型准确性,包括图像网络预培训、均衡抽样、数据增强、标签强化、模型汇总及其设计选择。通过以这些技术培训高效网络,我们获得了一个单一模型(13.6M参数)和一个组合模型,在音频Set上达到平均平均精确分数0.444和0.474,这分别超过了以往具有81M参数的0.439最佳系统。此外,我们的模型还在FSD50K上实现了一个新的最先进的MAP,即0.567。