In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.
翻译:在过去十年中,革命神经网络(CNNs)被广泛采用,作为终端到终端音频分类模型的主要基石,目的是学习从音频光谱到相应标签的直接绘图。为了更好地捕捉长距离全球背景,最近的趋势是在CNN上添加一个自留机制,形成CNN关注的混合模式。然而,尚不清楚是否有必要依赖CNN,以及纯粹基于关注的神经网络是否足以在音频分类中获得良好性能。在本文中,我们通过引入音频谱变换器(AST)来回答这一问题,这是第一个无声频变换器(AST),这是第一个无音频变换、纯粹基于关注的音频分类模式。我们根据各种音频分类基准对AST进行了评估,在其中,它实现了关于音频Set的0.485 mAP、关于ESC-50的95.6%的准确度和关于语音指令V2的98.1%的准确度。