We propose data and knowledge-driven approaches for multilingual training of the automated speech recognition (ASR) system for a target language by pooling speech data from multiple source languages. Exploiting the acoustic similarities between Indian languages, we implement two approaches. In phone/senone mapping, deep neural network (DNN) learns to map senones or phones from one language to the others, and the transcriptions of the source languages are modified such that they can be used along with the target language data to train and fine-tune the target language ASR system. In the other approach, we model the acoustic information for all the languages simultaneously by training a multitask DNN (MTDNN) to predict the senones of each language in different output layers. The cross-entropy loss and the weight update procedure are modified such that only the shared layers and the output layer responsible for predicting the senone classes of a language are updated during training, if the feature vector belongs to that particular language. In the low-resource setting (LRS), 40 hours of transcribed data each for Tamil, Telugu and Gujarati languages are used for training. The DNN based senone mapping technique gives relative improvements in word error rates (WER) of 9.66%, 7.2% and 15.21% over the baseline system for Tamil, Gujarati and Telugu languages, respectively. In medium-resourced setting (MRS), 160, 275 and 135 hours of data for Tamil, Kannada and Hindi languages are used, where, the same technique gives better relative improvements of 13.94%, 10.28% and 27.24% for Tamil, Kannada and Hindi, respectively. The MTDNN with senone mapping based training in LRS, gives higher relative WER improvements of 15.0%, 17.54% and 16.06%, respectively for Tamil, Gujarati and Telugu, whereas in MRS, we see improvements of 21.24% 21.05% and 30.17% for Tamil, Kannada and Hindi languages, respectively.
翻译:我们提出数据和知识驱动方法,用于对目标语言进行多语种自动语音识别(ASR)系统进行多语种培训。探索印度语言之间的声学相似性,我们实施两种方法。在电话/感官映射中,深神经网络(DNN)学会从一种语言到另一种语言的图像或电话,源语言的笔录经过修改,以便与目标语言数据一起用于培训和微调目标语言ASR系统。在另一种方法中,我们同时为所有语言制作音频信息模型,同时培训多语种DNN(MTDNN),以预测不同输出层中每种语言的语系。交叉性损失和重量更新程序经过修改,只有共同的层次和用于预测某种语言的语系的输出层和输出层才能更新,如果特性矢量属于特定语言的话。在低资源环境(LRS)中,泰米尔语系、泰鲁古尔和古塔纳尼(MNNNN)的每40小时数据改进,而用于泰米尔语系(TIR)的相对比例分别为15 %,用于SIMRMRRM的升级和20 %。