Over the past two decades, CNN architectures have produced compelling models of sound perception and cognition, learning hierarchical organizations of features. Analogous to successes in computer vision, audio feature classification can be optimized for a particular task of interest, over a wide variety of datasets and labels. In fact similar architectures designed for image understanding have proven effective for acoustic scene analysis. Here we propose applying Transformer based architectures without convolutional layers to raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200 categories, our model outperforms convolutional models to produce state of the art results. This is significant as unlike in natural language processing and computer vision, we do not perform unsupervised pre-training for outperforming convolutional architectures. On the same training set, with respect mean aver-age precision benchmarks, we show a significant improvement. We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work designed in the past few years. In addition, we also show how multi-rate signal processing ideas inspired from wavelets, can be applied to the Transformer embeddings to improve the results. We also show how our models learns a non-linear non constant band-width filter-bank, which shows an adaptable time frequency front end representation for the task of audio understanding, different from other tasks e.g. pitch estimation.
翻译:在过去二十年中,CNN的架构产生了令人信服的正确感知和认知模型,并学习了各种特征的等级组织。对计算机视觉的成功进行比对,可以优化音频特性分类,以完成各种数据集和标签等特定感兴趣的任务。事实上,为图像理解设计的类似架构已证明对声学场景分析有效。我们在这里建议应用没有相交层的基于变换器的架构来进行原始音频信号。在Free Sound 50K的标准数据集中,由200个类别组成,我们的模型超越了生成艺术结果状态的演动模型。这与自然语言处理和计算机视觉不同,具有重大意义,我们不为演化的演化组合结构进行不受监督的预培训。在相同的培训中,用平均精确度基准,我们展示了显著的改进。我们进一步改进了变换结构的绩效的方法,例如利用过去几年中设计的变换式网络工作所启发的集合技术。此外,我们还展示了如何多制信号处理由波子和计算机视野所启发的图像。这与自然语言处理不同,我们不易变换式结构的频率结构的任务也可以用于学习不固定的变换周期任务。