We present a CLSRIL-23, a self supervised learning based audio pre-trained model which learns cross lingual speech representations from raw audio across 23 Indic languages. It is built on top of wav2vec 2.0 which is solved by training a contrastive task over masked latent speech representations and jointly learns the quantization of latents shared across all languages. We compare the language wise loss during pretraining to compare effects of monolingual and multilingual pretraining. Performance on some downstream fine-tuning tasks for speech recognition is also compared and our experiments show that multilingual pretraining outperforms monolingual training, in terms of learning speech representations which encodes phonetic similarity of languages and also in terms of performance on down stream tasks. A decrease of 5% is observed in WER and 9.5% in CER when a multilingual pretrained model is used for finetuning in Hindi. All the code models are also open sourced. CLSRIL-23 is a model trained on $23$ languages and almost 10,000 hours of audio data to facilitate research in speech recognition for Indic languages. We hope that new state of the art systems will be created using the self supervised approach, especially for low resources Indic languages.
翻译:我们展示了CLSRIL-23, 这是一种自我监督的基于学习的预先培训的听觉模式,它从23种印度语的原始音频中学习跨语言语言语言表现。它建在wav2vec 2.0 之上,通过对隐蔽的潜在言语表现进行对比培训,共同学习所有语言共有的深层潜质的量化。我们比较了培训前语言方面的明智损失,比较了单语种和多语种预培训的效果。还比较了某些下游语音识别微调任务的绩效,我们的实验显示,多语言预培训优于单一语言培训,学习语言相似性的语言表现,以及下流任务的业绩。在WER和CER中,如果在印地语微调时使用多语言预先培训的模式,则减少了5%,在CER中减少了9.5%。所有代码模型也是开源的。CLSRIL-23是一套关于23美元语言和近10小时音数据的培训模型,以促进对英语语言的语音识别研究。我们希望将利用自我监督的方法创建新的艺术系统,特别是用于低语言。