Traditional automatic speech recognition~(ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the {LongFNT-Text} architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the {LongFNT} architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate~(WER) reduction, respectively.
翻译:传统的自动语音识别~ (ASR) 传统自动语音识别- (ASR) 系统通常侧重于单个语句, 而不考虑长式语音和有用的历史信息, 这在真实的情景中更为实际。 光是收看香草神经中继器模型的较长抄录历史, 初步实验没有多少好处, 因为预测网络不是一个纯语言模型。 这促使我们利用包含真实语言模型( 词汇预测器) 的因子化神经中转器结构。 我们提议 {LongFNT- Text} 结构, 它将句级长式功能直接与词汇预报器的输出连接起来, 然后在词汇预测器中嵌入象征性的长式功能, 并事先经过培训的背景编码器 RoBERTa 以进一步提升性能。 此外, 我们提议建立长式语音传输器结构, 包含一个真正的语言模型, 并实现最佳性能。 我们的长式FNTT 方法的有效性在 LibriSpeech 和 GigaSpeech Cora 中分别以19% 和12% 相对字错误率 ~ (WER ) 。