Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.
翻译:语言识别对于自动语音识别(ASR)中许多下游任务至关重要,而且作为一项额外任务,有助于融入多语言端到端的ASR。在本文件中,我们提议修改基于级联-编码的经常性神经网络传输器(RNN-T)模型的结构,将每个框架语言识别器(LID)预测器(LID)整合起来。 带有级联编码器的RNN-T可以使用没有右文本的先行解码,低潜度地实现ASR流,并使用使用右文本的二流解码来降低单词错误率(WERs),同时利用右文本的二流解码来利用这些差异和统计集合的流化实施,拟议方法可以实现准确流出LID预测,而无需额外的测试时间成本。 9种语言的语音搜索数据集的实验结果显示,拟议方法达到平均96.2%的LID预测准确度和通过输入LID或手动LID获得的二流值WER。