In this report, we presents low-complexity deep learning frameworks for acoustic scene classification (ASC). The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities. In particular, we initially transform audio recordings into Mel, Gammatone, and CQT spectrograms. Next, data augmentation methods of Random Cropping, Specaugment, and Mixup are then applied to generate augmented spectrograms before being fed into deep learning based classifiers. Finally, to achieve the best performance, we fuse probabilities which obtained from three individual classifiers, which are independently-trained with three type of spectrograms. Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%, improving DCASE baseline by 17.2%.
翻译:在本报告中,我们介绍了声学场景分类的低复杂深度学习框架。拟议框架可以分为四个主要步骤:前端光谱提取、在线数据增强、后端分类和预测概率的延迟融合。特别是,我们最初将录音转换成Mel、Gammatone和CQT光谱。接下来,随机裁剪、Specument和混集的数据增强方法被用来生成增强的光谱图,然后输入到深层基于学习的分类人员中。最后,为了实现最佳性能,我们结合了三个单个分类人员获得的概率,这三个分类人员经过独立培训,拥有三种光谱。我们在DCASE 2022 任务1开发数据集上进行的实验充分体现了低复杂性的要求,实现了60.1%的最佳分类精确度,使DCAS的基线提高了17.2%。