In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.
翻译:在本文中,我们提出一个统一的培训前方法,称为UniSpeech,用未贴标签和贴标签的数据来学习语言表达,在这种方法中,以多任务学习的方式,以监督的语音CTC学习和语音意识的对比性自我监督学习方式,以多任务学习的方式进行。由此产生的演示可以捕捉与语音结构更相关的信息,并改进各种语言和领域的通用化。我们评价UniSpeech在公共通用语音系统中的跨语言代表学习的有效性。结果显示,UniSpeech在自我监督的预培训和受监督的传输学习中,优于自我监督的语音识别前期学习,分别以13.4%和17.8%的相对电话错误率降低(所有测试语言的平均率 ) 。 UniSpeech的可转移性还体现在一个域变语音识别任务上,即相对于前一种方法,相对字差率降低6%。