SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast and efficient way. Specifically, we first incorporate linear attention (LA) and propose a hybrid LASA paradigm to increase the model's inference speed. Second, a hybrid neural architecture search (NAS) guided structural re-parameterization (SRep) mechanism, termed NSR, is proposed to enhance the ability of the model to extract local interactions. Extensive experiments conducted on the LibriSpeech dataset demonstrate that our proposed HybridFormer can achieve a 9.1% relative word error rate (WER) reduction over SqueezeFormer on the test-other dataset. Furthermore, when input speech is 30s, the HybridFormer can improve the model's inference speed up to 18%. Our source code is available online.
翻译:在自动语音识别(ASR)中,SquezeFormer最近表现出了令人印象深刻的性能。然而,它的推论速度受软磁力(SA)的二次复杂度影响。此外,由于大型电磁内核体大小的限制,SquezeFormer的本地模型能力不足。在本文中,我们提出一种新的方法混合法来快速高效地改进SquezeFormer。具体地说,我们首先纳入了线性关注(LA),并提出了一个混合的LASA范式,以提高模型的推断速度。第二,混合神经结构搜索(NAS)引领结构再测量(SRep)机制(NSR)被提议加强模型提取本地互动的能力。在LibriSpeech数据集上进行的广泛实验表明,我们提议的混合Former可以在测试其他数据集上对SquezeFormer进行9.1%的相对字错误率的降低。此外,当输入演讲为30年代时,混合式Former可以将模型的能量推升至18 %。</s>