End-to-end speech recognition is a promising technology for enabling compact automatic speech recognition (ASR) systems since it can unify the acoustic and language model into a single neural network. However, as a drawback, training of end-to-end speech recognizers always requires transcribed utterances. Since end-to-end models are also known to be severely data hungry, this constraint is crucial especially because obtaining transcribed utterances is costly and can possibly be impractical or impossible. This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data. Specifically, this paper attempts to transfer semantic knowledge acquired in embedding vectors of large-scale language models. Since embedding vectors can be assumed as implicit representations of linguistic information such as part-of-speech, intent, and so on, those are also expected to be useful modeling cues for ASR decoders. This paper extends two types of ASR decoders, attention-based decoders and neural transducers, by modifying training loss functions to include embedding prediction terms. The proposed systems were shown to be effective for error rate reduction without incurring extra computational costs in the decoding phase.
翻译:端到端语音识别是一个很有希望的技术,它能够将声语和语言模型统一成单一神经网络,因此,它是一种很有希望的技术,可以使核心自动语音识别系统(ASR)能够将声学和语言模型统一成单一神经网络,然而,作为一个缺点,对端到端语音识别器的培训总是需要转录音量。由于人们也知道端到端模型严重数据饥饿,因此这一限制至关重要,特别是因为获得转录语音识别器的费用昂贵,而且可能不切实际或不可能。本文件建议了一种缓解这一问题的方法,从语言模型神经神经网络转让知识,而这种网络可以预先用只读文本数据进行训练。具体来说,本文试图转让从大规模语言模型嵌入矢量中获取的语义学知识。由于嵌入矢量识别器可以被假定为语言信息隐含的表达方式,如部分语音、意图等等,因此,这一限制也有望成为用于为ASR解码器的示范提示器。本文扩展了两种类型的ASR解码器、关注解码解码器和神经传输器,通过修改培训损失功能,将培训损失功能包括嵌入式预测条款。在不增加错误预测值中,因此,拟议的系统在降低成本中显示。