Transformers have revolutionized the world of deep learning, specially in the field of natural language processing. Recently, the Audio Spectrogram Transformer (AST) was proposed for audio classification, leading to state of the art results in several datasets. However, in order for ASTs to outperform CNNs, pretraining with ImageNet is needed. In this paper, we study one component of the AST, the positional encoding, and propose several variants to improve the performance of ASTs trained from scratch, without ImageNet pretraining. Our best model, which incorporates conditional positional encodings, significantly improves performance on Audioset and ESC-50 compared to the original AST.
翻译:最近,音频分光变异器(AST)被推荐用于音频分类,导致若干数据集的最新结果。然而,为了使ASTS的功能超过CNN,需要用图像网络进行预先培训。在本文中,我们研究AST的一个组成部分,即定位编码,并提出若干变体来改进从零开始训练的AST的性能,而没有图像网络的预培训。我们的最佳模型包括有条件的定位编码,大大改进了音频和ESC-50的性能,与原AST相比。