Feature representations derived from models pre-trained on large-scale datasets have shown their generalizability on a variety of audio analysis tasks. Despite this generalizability, however, task-specific features can outperform if sufficient training data is available, as specific task-relevant properties can be learned. Furthermore, the complex pre-trained models bring considerable computational burdens during inference. We propose to leverage both detailed task-specific features from spectrogram input and generic pre-trained features by introducing two regularization methods that integrate the information of both feature classes. The workload is kept low during inference as the pre-trained features are only necessary for training. In experiments with the pre-trained features VGGish, OpenL3, and a combination of both, we show that the proposed methods not only outperform baseline methods, but also can improve state-of-the-art models on several audio classification tasks. The results also suggest that using the mixture of features performs better than using individual features.
翻译:在大型数据集方面经过预先培训的模型所得出的特征说明表明,它们对于各种音频分析任务具有一般的可操作性。尽管如此,由于可以学习与任务相关的特性,只要有足够的培训数据,具体任务的具体特点就优于现有培训数据。此外,经过培训的复杂模型在推论过程中带来了相当大的计算负担。我们提议通过采用两种结合两个特征类别信息的规范化方法,利用光谱输入和通用预培训特征中的详细任务特点。在推断期间,工作量一直较低,因为事先培训的特征仅对培训有必要。在与事先培训的VGGish、OpenL3和两者结合的实验中,我们表明,拟议的方法不仅超越了基准方法的完善性,而且还可以改进若干音频分类任务上的最新模型。结果还表明,使用混集的特征比使用单个特征要好。