Transformer-based models have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. This work proposes a dual-decoder transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a Conformer [1] encoder, two parallel transformer decoders, and a language classifier. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence along with language information. We consider phoneme recognition and language identification as auxiliary tasks in the multi-task learning framework. We jointly optimize the network for phoneme recognition, grapheme recognition, and language identification tasks with Joint CTC-Attention [2] training. Our experiments show that we can obtain a significant reduction in WER over the baseline approaches. We also show that our dual-decoder approach obtains significant improvement over the single decoder approach.
翻译:最近,基于变换器的模型在机器翻译和语音识别等序列到序列应用中变得非常流行。 这项工作为印度语言的低资源多语言语音识别提出了双解码变异器模型。 我们提议的模型包括一个Conuned [1] 编码器、两个平行变异器解变器和一个语言分类器。 我们使用电话识别任务和一个图形解码器(GRP-DEC)来预测图形式序列以及语言信息。 我们认为电话识别和语言识别是多任务学习框架中的辅助任务。 我们共同优化电话识别、图形化识别和语言识别任务网络,与CTC-Atention 联合培训 [2]。 我们的实验表明,我们可以在基线方法上大幅降低WER值。 我们还表明,我们的双解码方法在单一解码器方法上取得了显著的改进。