Squezeex: 用于自动语音识别的高效变换器 (Squeezeformer: An Efficient Transformer for Automatic Speech Recognition)

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After reexamining the design choices for both the macro and micro-architecture of Conformer, we propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure, which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of feed-forward module, followed up by multi-head attention or convolution modules, instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depth-wise downsampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate on Librispeech test-other without external language models. This is 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

翻译：最近提议的 Confred 模式已成为基于其混合关注和革命结构的各类下游演讲任务的实际主干模式。然而,通过一系列系统研究,我们发现 Confred 结构的设计选择并不最佳。在重新审查了对 Confred 宏观和微观结构的设计选择后,我们提出了Squezeerex 模式,该模式始终优于同一培训计划下的最新 ASR 模式。特别是,对于宏观结构结构,Squezeerut 包含(一) Temalal U-Net结构,该结构降低了长序列多头关注模块的成本,以及(二) 一个更简单的进化模块块结构,随后是多头关注或进化模块,而不是Conform提出的马卡伦结构。此外,对于微结构,Squeericerective 模型(i) 简化了变压块的启动过程,(二) 去除了长期多头关注模块模块的成本,(二) 将精密的内含精度的内含精度(Outrial-licomal) 的内含精度测试深度,并(三) 将精度的内含精度的内含精度的精度(OL-ral-L-l化) 的内含) 的内含精选结果转化为的精度(内含) 的精度(内含精选) 的精选) 的内含精选的内含精度(内含精选) 的精度(内含精度)的精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精度-精