Transformers have seen an unprecedented rise in Natural Language Processing and Computer Vision tasks. However, in audio tasks, they are either infeasible to train due to extremely large sequence length of audio waveforms or incur a performance penalty when trained on Fourier-based features. In this work, we introduce an architecture, Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in keyword spotting with raw audio waveforms, outperforming all previous methods while being computationally cheaper and parameter-efficient. Additionally, our model has practical advantages for speech processing, such as inference on arbitrarily long audio clips owing to the absence of positional encoding. The code is available at https://github.com/The-Learning-Machines/Audiomer-PyTorch.
翻译:然而,在音频任务中,由于音波变形的序列长度极长,它们要么由于音波变形的顺序极长而无法接受培训,要么在对基于Fourier的特征进行训练时会受到性能处罚。 在这项工作中,我们引入了一个架构,即音频器,将1D残余网络与表演者关注结合起来,从而实现与原始音波变形一道发现关键词的最先进性能,在计算成本和参数效率方面优于以往所有方法。此外,我们的模型在语言处理方面具有实际优势,例如,由于没有定位编码而任意长音频剪的推论。代码可在https://github.com/The-Learch-Machines/Audiomer-PyTorch查阅。