While the Turkish language is listed among low-resource languages, literature on Turkish automatic speech recognition (ASR) is relatively old. In this paper, we present HuBERT-TR, a speech representation model for Turkish based on HuBERT. HuBERT-TR achieves state-of-the-art results on several Turkish ASR datasets. We investigate pre-training HuBERT for Turkish with large-scale data curated from online resources. We pre-train HuBERT-TR using over 6,500 hours of speech data curated from YouTube that includes extensive variability in terms of quality and genre. We show that pre-trained models within a multi-lingual setup are inferior to language-specific models, where our Turkish model HuBERT-TR base performs better than its x10 times larger multi-lingual counterpart XLS-R-1B. Moreover, we study the effect of scaling on ASR performance by scaling our models up to 1B parameters. Our best model yields a state-of-the-art word error rate of 4.97% on the Turkish Broadcast News dataset. Models are available at huggingface.co/asafaya .
翻译:虽然土耳其语被列为低资源语言,但土耳其自动语音识别(ASR)的文献相对比较陈旧。 在本文中,我们介绍了基于HuBERT的土耳其语语音代表模式HuBERT-TR。 HuBERT-TR在几个土耳其的ASR数据集中取得了最新成果。 我们调查了土耳其语的HuBERT培训前用在线资源整理的大规模数据对土耳其语进行的培训。 我们使用YouTube6500多小时的语音数据进行预培训 HuBERT-TR, 这些数据在质量和类型上具有广泛的差异。 我们显示,在多语言结构中,预先培训的模型比语言特定模型要差。 我们的土耳其语模型HuBERT-TR基地的表现比其x10倍的多语言对应方XLS-R-1B。 此外,我们通过将我们的模型提升到1B参数,研究如何提升ASR的性能。 我们的最佳模型在土耳其广播新闻数据集上产生4.97%的状态词误率。 模型见于 hangingfacefaceco/asaya。