Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring the knowledge from large models to on-device models. However, most lack the essence of the temporal information which is crucial to audio classification tasks, or similar architecture is often required. In this paper, we propose a new knowledge distillation method designed to incorporate the temporal knowledge embedded in attention weights of large models to on-device models. Our distillation method is applicable to various types of architectures, including the non-attention-based architectures such as CNNs or RNNs, without any architectural change during inference. Through extensive experiments on both an audio event detection dataset and a noisy keyword spotting dataset, we show that our proposed method improves the predictive performance across diverse on-device architectures.
翻译:鉴于移动环境的计算局限性,改进设备内音频分类模型的性能仍是一项挑战。许多研究利用知识蒸馏利用知识蒸馏将知识从大型模型向设备内模型转移,从而通过将知识从大型模型转移到设备内模型来提高预测性能。然而,大多数研究缺乏对音频分类任务至关重要的时间信息精髓,或类似结构往往需要。在本文件中,我们提出了一个新的知识蒸馏方法,旨在将大型模型的注意力重量中所包含的时间知识纳入设备内模型内。我们的蒸馏方法适用于各种类型的建筑,包括CNN或RNNS等非机密性建筑,在推断期间没有进行任何建筑性的变化。通过对音频事件探测数据集进行的广泛实验和热门关键词定位数据集的广泛实验,我们表明我们提出的方法改善了各种设备内结构的预测性能。