Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. Most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., mel-spectrogram) could be an effective approach for efficient audio classification. To do so, we proposed a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information within the mel-spectrogram. We perform extensive experiments on four audio classification tasks to evaluate the performance of SimPFs. Experimental results show that SimPFs can achieve a reduction in more than half of the number of floating point operations (FLOPs) for off-the-shelf audio neural networks, with negligible degradation or even some improvements in audio classification performance.
翻译:最近,人们越来越有兴趣为安装安装装置假想方案建立高效的音频神经网络,大多数现有办法旨在使用模型裁剪等方法减少音频网络的规模。在这项工作中,我们表明,通过使用复杂方法减少模型规模,而不是减少输入音频特性(例如Mel-spectrogrogram)的时间冗余,可以成为高效音频分类的有效办法。为此,我们建议建立一个简单化的集合前端(SimPFs)系统,使用简单的非参数集合操作来减少中线光谱系统中的冗余信息。我们进行了四项音频分类任务的广泛实验,以评价SimPFs的性能。实验结果显示,SimPFs系统可以减少现成音频神经网络的浮点操作(FLOPs)的一半以上,其微小的退化甚至音频分类性能的改善。