The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks.
翻译:监督深层学习方法的成功主要取决于它们从原始数据中学习相关特征的能力。在大型数据集方面受过培训的深神经网络(DNNS)能够捕捉多种功能,学习能够概括到来自同一领域的无形任务和数据集的演示。因此,这些模型可以用作强大的特征提取器,同时与作为分类师的更浅模型相结合,用于较小的任务和数据集,而培训数据的数量不足以从零开始学习端到端模型。在过去几年中,Convolutional Neural网络(CNNS)基本上成为音频处理的选择方法。然而,最近基于关注的变异器模型在监督下的设置中表现出巨大的潜力,比CNN还差。在这项工作中,我们调查了如何使用受过大规模数据集培训的音效变异器来学习通用演示。我们研究这些音频变异器中的不同设置如何影响其从零到端到端模型的嵌入质量。我们试验模型的解析、提取嵌入级网络,以及接受的字段,以便看它们如何影响我们经培训的变现的变异图像的变现系统,从而显示我们的变现式格式。