Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance.
翻译:以关注为基础的编码器解码器( AED) 模型在语音识别中取得了有希望的绩效。 但是,由于解码器以自动递增的方式预测文本符号(例如字符或单词),因此AED模型很难同时预测所有符号。 这使得推论速度相对缓慢。 我们认为,由于编码器已经捕捉了整个语音表达式,它具有象征性的隐含关系,我们可以预测一个象征,而没有明确自动递增语言模型。当信号的预测不依赖其他符号时,对序列中所有符号的平行预测是可以实现的。基于这个想法,我们建议一个非显性化的语音识别模式,称为LASO(听Attentive,和Spell Oretre)。由于编码器已经捕捉到整个语音表达式,因此我们可以预测一个象征性语言演示式的标志。 三个模块基于基本关注区块。 编码解析器从演讲中提取高层次的演示文。 PDSDS进一步使用与质化位置对应的位置, 将图像显示的图像表达式表达式转换为感官表示式。 在最后的列表中, 度上, 将存储式表达式表达式自我表示式表示式 。