This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. We extend this work to [1].
翻译:本技术报告描述了我们TASK1A提交的DCASE2021挑战的详情。任务的目标是设计一个音频场景分类系统,用于在模型复杂度的限制下建立装置平衡的数据集。本报告介绍了实现这一目标的四种方法。首先,我们提出残余正常化,这是一种新颖的特征正常化方法,使用例常化的捷径,丢弃不必要的特定装置信息,同时又不失去有用的分类信息。第二,我们设计了一个高效的架构,BC-ResNet-Mod,一个有有限可接受域的修改版基线架构。第三,我们利用光谱图到分光谱仪的翻译从一个到多个装置来增加培训数据。最后,我们利用三种模式压缩计划:剪裁、量和知识蒸馏,以降低模型复杂度。拟议系统在TAU Awnal Acoucistic Scenes 2020移动中实现了76.3%的平均测试精度测试精度,开发数据集有315k参数,在压缩为61.0KB的非零参数后平均测试精度为75.3%。我们将这项工作扩大到[1]。