In this paper, we presents a low-complexity deep learning frameworks for acoustic scene classification (ASC). The proposed framework can be separated into three main steps: Front-end spectrogram extraction, back-end classification, and late fusion of predicted probabilities. First, we use Mel filter, Gammatone filter and Constant Q Transfrom (CQT) to transform raw audio signal into spectrograms, where both frequency and temporal features are presented. Three spectrograms are then fed into three individual back-end convolutional neural networks (CNNs), classifying into ten urban scenes. Finally, a late fusion of three predicted probabilities obtained from three CNNs is conducted to achieve the final classification result. To reduce the complexity of our proposed CNN network, we apply two model compression techniques: model restriction and decomposed convolution. Our extensive experiments, which are conducted on DCASE 2021 (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1A development dataset, achieve a low-complexity CNN based framework with 128 KB trainable parameters and the best classification accuracy of 66.7%, improving DCASE baseline by 19.0%
翻译:在本文中,我们提出了一个用于声学场景分类(ASC)的低复杂深度学习框架。拟议框架可以分为三个主要步骤:前端光谱提取、后端分类和预测概率的延迟混合。首先,我们使用梅尔过滤器、伽马酮过滤器和Constant Q Transfrom (CQT)将原始音频信号转换成光谱仪,其中既显示频率,也显示时间特征。然后将三种光谱图输入三个单独的后端神经神经网络(CNNs),分为10个城市。最后,从3个CNN获得的三种预测概率的延迟结合,以实现最后的分类结果。为降低我们拟议的CNN网络的复杂性,我们采用了两种模型压缩技术:模型限制和分解卷。我们在DCASE 2021(IEASP对声波场景和事件的探测和分类的挑战)任务1A,A数据集,实现了基于128 KB培训参数的低兼容性CNN框架,并用19个基准参数和最佳的精确度来改进DC%的基线。