Low resource automatic speech recognition (ASR) is a useful but thorny task, since deep learning ASR models usually need huge amounts of training data. The existing models mostly established a bottleneck (BN) layer by pre-training on a large source language, and transferring to the low resource target language. In this work, we introduced an adaptive activation network to the upper layers of ASR model, and applied different activation functions to different languages. We also proposed two approaches to train the model: (1) cross-lingual learning, replacing the activation function from source language to target language, (2) multilingual learning, jointly training the Connectionist Temporal Classification (CTC) loss of each language and the relevance of different languages. Our experiments on IARPA Babel datasets demonstrated that our approaches outperform the from-scratch training and traditional bottleneck feature based methods. In addition, combining the cross-lingual learning and multilingual learning together could further improve the performance of multilingual speech recognition.
翻译:低资源自动语音识别(ASR)是一项有用但棘手的任务,因为深学习的ASR模式通常需要大量的培训数据,现有模式大多通过对大源语言进行预先培训,并转让到低资源目标语言,建立了瓶颈层。在这项工作中,我们向ASR模式的上层引入了适应性启动网络,对不同语言应用了不同的启动功能。我们还提出了两种培训模式的方法:(1) 跨语言学习,将源语言的启动功能替换为目标语言;(2) 多语言学习,联合培训每种语言的连接时间分类(CTC)损失和不同语言的相关性。我们对IARPA Babel数据集的实验表明,我们的方法超越了基于Scratch培训和传统瓶颈特征的方法。此外,将跨语言学习和多语言学习结合起来可以进一步提高多语言语音识别的绩效。