The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. Experiments were carried out on the large-scale multilingual LibriSpeech and VoxLingua107 datasets. Experimental results showed the proposed method significantly improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain datasets, respectively.
翻译:声学和语言特征是口语识别(LID)任务的重要提示。 最近的先进LID系统主要使用缺乏明确语言特征编码的声学特征。 在本文中,我们提议为LID任务采用新型基于传感器的语言嵌入方法,将一个RNN 传感器模型纳入语言嵌入框架。 借助RNN 传感器的语言表达能力的优势, 拟议的方法可以利用具有语音觉悟的声学特征和显性语言特征来完成LID任务。 对大型多语言LibriSpeech和VoxLingua107数据集进行了实验。 实验结果表明,拟议方法极大地改进了LID任务的绩效,分别提高了12%至59%和16%至24%在部内和跨部数据集方面的相对改进。