Transformers have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. In this work, we propose a multi-task learning-based transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a conformer [1] encoder and two parallel transformer decoders. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence. We consider the phoneme recognition task as an auxiliary task for our multi-task learning framework. We jointly optimize the network for both phoneme and grapheme recognition tasks using Joint CTC-Attention [2] training. We use a conditional decoding scheme to inject the language information into the model before predicting the grapheme sequence. Our experiments show that our proposed approach can obtain significant improvement over previous approaches [4]. We also show that our conformer-based dual-decoder approach outperforms both the transformer-based dual-decoder approach and single decoder approach. Finally, We compare monolingual ASR models with our proposed multilingual ASR approach.
翻译:在这项工作中,我们提议了一个多任务学习型变压器模型,用于印度语言的低资源多语种语音识别。我们提议的模型包括一个符合[1]1的编码器和两个平行变压器解码器。我们使用一个电话解码器(PHN-DEC)进行电话识别任务,并使用一个图形解码器(GRP-DEC)来预测图形解码器序列。我们认为,电话识别任务是我们多任务学习框架的辅助任务。我们共同利用联合CTC-Avention[2]培训优化电话和图形化识别任务的网络[2]。我们使用一个有条件的解码方案,在预测图形序列之前将语言信息输入到模型中。我们的实验表明,我们拟议的方法可以比以前的方法[4]取得显著的改进。我们还表明,我们基于兼容器的双解码器方法超越了我们基于变压器的双重解码器方法和单一解码器方法。最后,我们将单一的多语文模式与AdolioSR模型进行比较。