The success of supervised deep learning methods is largely due to their ability to learn relevant features from raw data. Deep Neural Networks (DNNs) trained on large-scale datasets are capable of capturing a diverse set of features, and learning a representation that can generalize onto unseen tasks and datasets that are from the same domain. Hence, these models can be used as powerful feature extractors, in combination with shallower models as classifiers, for smaller tasks and datasets where the amount of training data is insufficient for learning an end-to-end model from scratch. During the past years, Convolutional Neural Networks (CNNs) have largely been the method of choice for audio processing. However, recently attention-based transformer models have demonstrated great potential in supervised settings, outperforming CNNs. In this work, we investigate the use of audio transformers trained on large-scale datasets to learn general-purpose representations. We study how the different setups in these audio transformers affect the quality of their embeddings. We experiment with the models' time resolution, extracted embedding level, and receptive fields in order to see how they affect performance on a variety of tasks and datasets, following the HEAR 2021 NeurIPS challenge evaluation setup. Our results show that representations extracted by audio transformers outperform CNN representations. Furthermore, we will show that transformers trained on Audioset can be extremely effective representation extractors for a wide range of downstream tasks.
翻译:监督深层学习方法的成功主要取决于它们从原始数据中学习相关特征的能力。在大型数据集方面受过培训的深神经网络(DNNS)能够捕捉多种功能,学习能够概括到来自同一领域的无形任务和数据集的演示。因此,这些模型可以用作强大的特征提取器,与作为分类师的更浅模型相结合,用于较小的任务和数据集,而培训数据的数量不足以从零开始学习端到端模型。过去几年,Convolutional Neural网络(CNNS)在很大程度上成为音频处理的选择方法。然而,最近基于关注的变异器模型在监督下的环境下展示出巨大的潜力,比CNNS还要强。在这项工作中,我们调查了使用大型数据集培训的音频变异器学习一般的演示。我们研究这些变异器的不同设置如何影响其从零到终端到终端模型的嵌入质量。我们实验模型的时分辨率解析、嵌入级网络(Contracialal Ne)是音频处理方法。但是,最近的变异变模型模型模型模型模型模型模型显示我们如何影响了20级变现的变异图像的演化结果。</s>