Audio event classification is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing the classification accuracy, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio event classification models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present PSLA, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices. By training an EfficientNet with these techniques, we obtain a model that achieves a new state-of-the-art mean average precision (mAP) of 0.474 on AudioSet, outperforming the previous best system of 0.439.
翻译:音频事件分类是一个积极的研究领域,应用范围很广。自《AudioSet》发布以来,在提高分类准确性方面取得了很大进展,这主要来自开发新型模型结构和关注模块。然而,我们发现,适当的培训技术对于用《AudioSet》建立音频事件分类模型同样重要,但没有得到应有的重视。为了填补这一空白,我们向PSLA展示了一套培训技术,这些技术可以明显提高模型准确性,包括图像网络预培训、均衡抽样、数据增强、标签强化、模型汇总及其设计选择。我们通过以这些技术培训一个高效的网络,获得了一种模型,在音频Set上实现一个新的最先进的平均平均精确度(0.474 MAP),超过了以前的0.439最佳系统。