The attention-based Transformers have been increasingly applied to audio classification because of their global receptive field and ability to handle long-term dependency. However, the existing frameworks which are mainly extended from the Vision Transformers are not perfectly compatible with audio signals. In this paper, we introduce a Causal Audio Transformer (CAT) consisting of a Multi-Resolution Multi-Feature (MRMF) feature extraction with an acoustic attention block for more optimized audio modeling. In addition, we propose a causal module that alleviates over-fitting, helps with knowledge transfer, and improves interpretability. CAT obtains higher or comparable state-of-the-art classification performance on ESC50, AudioSet and UrbanSound8K datasets, and can be easily generalized to other Transformer-based models.
翻译:关注型变异器越来越多地用于音频分类,因为它们具有全球可接受领域和处理长期依赖性的能力,然而,主要从愿景变异器扩展的现有框架与音频信号不完全兼容。在本文件中,我们引入了由多分辨率多功能多功能变异器(MRMF)特征集成的立体声变异器(CAT ), 配有声频关注块, 以进行更优化的音频建模。 此外,我们提出一个因果模块,以缓解超配,帮助知识转让,并改进可解释性。 CAT在 ESC50、音频Set 和 UrbanSound8K 数据集方面获得了更高或可比的最新分类表现, 并且可以很容易地推广到其他以变异器为基础的模型。</s>