It is a practical research topic how to deal with multi-device audio inputs by a single acoustic scene classification system with efficient design. In this work, we propose Residual Normalization, a novel feature normalization method that uses frequency-wise normalization % instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Moreover, we introduce an efficient architecture, BC-ResNet-ASC, a modified version of the baseline architecture with a limited receptive field. BC-ResNet-ASC outperforms the baseline architecture even though it contains the small number of parameters. Through three model compression schemes: pruning, quantization, and knowledge distillation, we can reduce model complexity further while mitigating the performance degradation. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. The proposed method won the 1st place in DCASE 2021 challenge, TASK1A.
翻译:这是一个实用的研究课题,如何通过一个具有高效设计的单一声学场景分类系统处理多设备音频输入。在这项工作中,我们提议了残余正常化,这是一种新颖的特征正常化方法,使用频率-正常化%实例,采用捷径路丢弃不必要的设备专用信息,同时又不失去有用的分类信息。此外,我们引入了一个高效的架构,即BC-ResNet-ASC,即基准结构的修改版,其可接受域有限。BC-ResNet-ASC超越基线结构,尽管它包含少量参数。通过三种模型压缩计划:修剪裁、量化和知识蒸馏,我们可以进一步降低模型复杂性,同时减缓性能退化。拟议的系统在TAU城市声学2020移动中实现了平均测试精度76.3%,开发数据集具有315k参数,在压缩为61.0KB的非零参数后平均测试精度为75.3%。拟议方法赢得了DCASE 2021第1次挑战,TASK1A。