Recently, there has been increasing interest in building efficient audio neural networks for on-device scenarios. While most existing approaches are designed to reduce the size of audio neural networks using methods such as model pruning. In this work, we show that instead of reducing model size using complex methods, eliminating the temporal redundancy in the input audio features (e.g., Mel-spectrogram) could be an effective approach for efficient audio classification. To do so, we proposed a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information within the Mel-spectrogram. We perform extensive experiments on four audio classification tasks to evaluate the performance of SimPFs. Experimental results show that SimPFs can achieve a reduction in more than half of the FLOPs for off-the-shelf audio neural networks, with negligible degradation or even decent improvement in audio classification performance.
翻译:最近,人们越来越有兴趣为安装安装安装装置假想的高效音频神经网络。 虽然大多数现有办法都旨在使用模型裁剪等方法缩小音频网络的规模。 在这项工作中,我们表明,不使用复杂方法缩小模型规模,而是消除输入音频特征(例如Mel-spectrogram)的时冗,这可以成为高效音频分类的有效办法。 为此,我们提议建立一个简单集合前端(SimPFs)的组合,使用简单的非参数集合操作来减少Mel-spectrogram的多余信息。我们进行了四项音频分类任务的广泛实验,以评估SimPFs的性能。实验结果显示,SimPFs可以减少超现成音频神经网络超过半数的FLOP,而音频分类性能的退化微不足道,甚至有体面的改进。