A good joint training framework is very helpful to improve the performances of weakly supervised audio tagging (AT) and acoustic event detection (AED) simultaneously. In this study, we propose three methods to improve the best teacher-student framework in the IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 4 for both audio tagging and acoustic events detection tasks. A frame-level target-events based deep feature distillation is first proposed, which aims to leverage the potential of limited strong-labeled data in weakly supervised framework to learn better intermediate feature maps. Then, we propose an adaptive focal loss and two-stage training strategy to enable an effective and more accurate model training, where the contribution of hard and easy acoustic events to the total cost function can be automatically adjusted. Furthermore, an event-specific post processing is designed to improve the prediction of target event time-stamps. Our experiments are performed on the public DCASE 2019 Task 4 dataset, results show that our approach achieves competitive performances in both AT (81.2\% F1-score) and AED (49.8\% F1-score) tasks.
翻译:良好的联合培训框架非常有助于同时改善监督不力的音频标记和声学事件探测(AED)的性能,在这项研究中,我们提出三种方法改进IEEE AASP关于探测和分类声学场景和事件的挑战(DCASE) 2019任务4中的最佳师生框架,用于音频标记和声学事件探测任务。首先提出基于深度地貌的框架级目标蒸馏活动,目的是在监督不力的框架中利用有限的强标数据的潜力,学习更好的中间地貌图。然后,我们提出适应性中心损失和两阶段培训战略,以便能够进行有效和更加准确的示范培训,使硬和轻松的声学事件对总成本功能的贡献能够自动调整。此外,针对特定事件的后处理旨在改进对目标事件时间戳的预测。我们是在公共DCASE 2019任务4数据集进行实验,结果显示我们的方法在AT(8.12-F1核心)和AEDD(49.8-F1核心)两个任务中取得了竞争性的业绩。