Improving the performance of on-device audio classification models remains a challenge given the computational limits of the mobile environment. Many studies leverage knowledge distillation to boost predictive performance by transferring the knowledge from large models to on-device models. However, most lack a mechanism to distill the essence of the temporal information, which is crucial to audio classification tasks, or similar architecture is often required. In this paper, we propose a new knowledge distillation method designed to incorporate the temporal knowledge embedded in attention weights of large transformer-based models into on-device models. Our distillation method is applicable to various types of architectures, including the non-attention-based architectures such as CNNs or RNNs, while retaining the original network architecture during inference. Through extensive experiments on both an audio event detection dataset and a noisy keyword spotting dataset, we show that our proposed method improves the predictive performance across diverse on-device architectures.
翻译:鉴于移动环境的计算局限性,改进设备内音频分类模型的性能仍是一项挑战。许多研究利用知识蒸馏利用知识蒸馏将知识从大型模型向设备内模型转移来提高预测性能。然而,大多数研究缺乏提炼时间信息精髓的机制,而时间信息对音频分类任务至关重要,或类似结构往往是必不可少的。在本文件中,我们提出了一个新的知识蒸馏方法,旨在将大型变压器模型的注意重量中所包含的时间知识纳入设备内模型。我们的蒸馏方法适用于各种类型的建筑,包括CNNs或RNNs等非用户结构,同时在推断过程中保留原始网络结构。通过对音频事件探测数据集和噪音关键字定位数据集的广泛实验,我们表明我们提出的方法改善了各种设备内结构的预测性能。