The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.
翻译:变换器结构在许多领域都取得了成功,包括自然语言处理、计算机视觉和语音识别。 在关键词识别中,自我注意主要用于进化式或经常性编码器之上。 我们调查了使变换器结构适应关键字识别和引入关键字变换器(KWT)的一系列方法。 关键字变换器是一个完全自我注意的架构,它超越了多项任务的最新性能,而没有任何培训前或额外数据。 令人惊讶的是,这一简单架构超越了将进化层、经常性层和专注层混合在一起的更复杂的模型。 KWT可以用作这些模型的即时替换,在Google语音指令数据集上设定了两个新的基准记录,其精确度分别为98.6%和97.7%的12项和35项任务。