Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two macaron-like feed-forward layers with half-step residual connections sandwich the multi-head self-attention and convolution modules followed by a post layer normalization. We improve Conformer's long-sequence representation ability in two directions, \emph{sparser} and \emph{deeper}. We adapt a sparse self-attention mechanism with $\mathcal{O}(L\text{log}L)$ in time complexity and memory usage. A deep normalization strategy is utilized when performing residual connections to ensure our training of hundred-level Conformer blocks. On the Japanese CSJ-500h dataset, this deep sparse Conformer achieves respectively CERs of 5.52\%, 4.03\% and 4.50\% on the three evaluation sets and 4.16\%, 2.84\% and 3.20\% when ensembling five deep sparse Conformer variants from 12 to 16, 17, 50, and finally 100 encoder layers.
翻译:通过利用变压器捕捉基于内容的全球互动和进化神经网络利用当地特点,自动语音识别取得了令人印象深刻的成果。在变压器中,两层马卡龙式的进化向前层,配有半阶段剩余连接,配有多头自留和进化模块,后加一层正常化。我们提高了变压器在两个方向,即\emph{sparser} 和\emph{diter}的长期序列代表能力。我们用时间复杂性和记忆用美元调整一个稀少的自留机制(L\text{log}L),在进行剩余连接时使用了一种深度的正常化战略,以确保我们培训100级的组合块。在日本的CSJ-500h数据集中,这种深度稀薄的组合在三个评价组和4.16{O}、2.84}和3.20之间分别实现了5.52 ⁇ 、4.03 ⁇ 和4.50的核证的排减量。当将五种深度分散的变式从12至16级、17级、50级和50级。