Recently Convolution-augmented Transformer (Conformer) has shown promising results in Automatic Speech Recognition (ASR), outperforming the previous best published Transformer Transducer. In this work, we believe that the output information of each block in the encoder and decoder is not completely inclusive, in other words, their output information may be complementary. We study how to take advantage of the complementary information of each block in a parameter-efficient way, and it is expected that this may lead to more robust performance. Therefore we propose the Block-augmented Transformer for speech recognition, named Blockformer. We have implemented two block ensemble methods: the base Weighted Sum of the Blocks Output (Base-WSBO), and the Squeeze-and-Excitation module to Weighted Sum of the Blocks Output (SE-WSBO). Experiments have proved that the Blockformer significantly outperforms the state-of-the-art Conformer-based models on AISHELL-1, our model achieves a CER of 4.35\% without using a language model and 4.10\% with an external language model on the testset.
翻译:最近革命强化变异器(Confer)在自动语音识别(ASR)中显示出了令人乐观的结果,这比以前出版的最佳变异器转换器显示得要好。 在这项工作中,我们认为编码器和解码器中每个区块的输出信息并不完全包容,换句话说,它们的输出信息可能是互补的。我们研究如何以具有参数效率的方式利用每个区块的补充信息,预计这可能导致更强的性能。因此,我们建议使用块状变异器(ASR)进行语音识别。我们采用了两个区块混合方法:区块输出(Base-WSBO)的基本加权总和块输出(SE-WSOBO)的微光输出模块(SE-WSOBO)。实验已经证明,阻断器大大超越了AISELL-1号上以艺术为主的模式。 我们的模型在不使用语言模型和4.10-10的外部语言模型进行测试的情况下实现了4.35 ⁇ 的CER。