The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be easily applied for streaming or online ASR. For self-attention in Transformer ASR, the softmax normalization function-based attention mechanism makes it impossible to highlight important speech information. For multi-head attention in Transformer ASR, it is not easy to model monotonic alignments in different heads. To overcome these two limits, we integrate sparse attention and monotonic attention into Transformer-based ASR. The sparse mechanism introduces a learned sparsity scheme to enable each self-attention structure to fit the corresponding head better. The monotonic attention deploys regularization to prune redundant heads for the multi-head attention structure. The experiments show that our method can effectively improve the attention mechanism on widely used benchmarks of speech recognition.
翻译:以自我注意和多头关注为基础的变压器结构模型在脱线端对端自动语音识别方面取得了显著成功,然而,自我注意和多头关注无法轻易地应用于流动或在线ASR。对于变压器ASR的自我注意,软式最大正常化功能关注机制使得无法突出重要的语音信息。对于变压器ASR的多头关注,在不同头目中模拟单调并非易事。为了克服这两个限制,我们将微弱的注意力和单声关注纳入以变压器为基础的ASR。稀有的机制引入了一种学习的宽松计划,使每个自控结构能够更好地适应相应的头部。单调式关注使多头关注结构的多余头部处于正常状态。实验表明,我们的方法可以有效地改进广泛使用的语音识别基准的注意机制。