Most machine learning models for audio tasks are dealing with a handcrafted feature, the spectrogram. However, it is still unknown whether the spectrogram could be replaced with deep learning based features. In this paper, we answer this question by comparing the different learnable neural networks extracting features with a successful spectrogram model and proposed a General Audio Feature eXtractor (GAFX) based on a dual U-Net (GAFX-U), ResNet (GAFX-R), and Attention (GAFX-A) modules. We design experiments to evaluate this model on the music genre classification task on the GTZAN dataset and perform a detailed ablation study of different configurations of our framework and our model GAFX-U, following the Audio Spectrogram Transformer (AST) classifier achieves competitive performance.
翻译:用于听力任务的多数机器学习模型都涉及手工制作的特征,即光谱图。然而,光谱图能否被深层学习特征所取代还不清楚。在本文中,我们通过比较不同的可学习神经网络提取特征与成功的光谱模型,回答这一问题,并提议了一个通用音频特征吸引器(GAFX),其基础是双U-Net(GAFX-U)、ResNet(GAFX-R)和注意(GAFX-A)模块。我们设计了实验,以评价GTZAN数据集音乐基因分类任务的模型,并对我们框架的不同配置和我们的GAFX-U模型进行详细的调整研究,并遵循音频谱变换器(AST)的模块,实现竞争性性能。