This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters.
翻译:本技术报告描述了我们TASK1A提交的DCASE2021挑战的详情。本任务的目标是设计一个音频现场分类系统,用于在模型复杂度的限制下建立装置平衡的数据集。本报告介绍了实现这一目标的四种方法。首先,我们提出残余正常化,这是一种新颖的特征正常化方法,使用例常化和捷径路径丢弃不必要的特定设备信息,同时又不失去有用的分类信息。第二,我们设计了一个高效的架构,BC-ResNet-Mod,一个有有限容留域的修改版基线架构。第三,我们利用从一个设备到多个设备的光谱到分光谱转换系统来增加培训数据。最后,我们利用三种模式压缩计划:剪裁、量和知识蒸馏,以减少模型复杂度。拟议系统在TAU城市2020 声学模型中实现了平均测试精度76.3%的测试精度,具有315k参数的开发数据集,在将非零参数压缩至61.0KB之后平均测试精度为75.3%。