Data-Free Knowledge Distillation (DFKD) has recently attracted growing attention in the academic community, especially with major breakthroughs in computer vision. Despite promising results, the technique has not been well applied to audio and signal processing. Due to the variable duration of audio signals, it has its own unique way of modeling. In this work, we propose feature-rich audio model inversion (FRAMI), a data-free knowledge distillation framework for general sound classification tasks. It first generates high-quality and feature-rich Mel-spectrograms through a feature-invariant contrastive loss. Then, the hidden states before and after the statistics pooling layer are reused when knowledge distillation is performed on these feature-rich samples. Experimental results on the Urbansound8k, ESC-50, and audioMNIST datasets demonstrate that FRAMI can generate feature-rich samples. Meanwhile, the accuracy of the student model is further improved by reusing the hidden state and significantly outperforms the baseline method.
翻译:最近,无数据知识蒸馏(DFKD)在学术界引起了越来越多的注意,特别是在计算机视觉方面的重大突破。尽管取得了令人乐观的成果,但这一技术并没有被很好地应用于音频和信号处理。由于音频信号的持续时间不同,它有其独特的建模方式。在这项工作中,我们提议了具有地貌丰富的音频模型转换(FRAMI),这是一个用于一般健全分类任务的无数据知识蒸馏框架。它首先通过地貌差异性对比损失生成高质量和地貌丰富的Mel分光谱。然后,当在这些具有地貌特征的样品上进行知识蒸馏时,在统计集合层之前和之后的隐藏状态被重新利用。城市声波8k、ESC-50和音频MNIST数据集的实验结果表明,FRIMIM能够产生具有地貌丰富样品。与此同时,学生模型的准确性通过重新使用隐藏状态和大大超过基线方法而得到进一步提高。</s>