While the Turkish language is listed among low-resource languages, literature on Turkish automatic speech recognition (ASR) is relatively old. In this paper, we present HuBERT-TR, a speech representation model for Turkish, based on HuBERT. HuBERT-TR achieves state-of-the-art results on several Turkish ASR datasets. We investigate pre-training HuBERT for Turkish with large-scale data curated from online resources. We pre-train HuBERT-TR using over 6,500 hours of speech data curated from YouTube that includes extensive variability in terms of quality and genre. We show that language-specific models are superior to other pre-trained models, where our Turkish model HuBERT-TR/base performs better than the x10 times larger state-of-the-art multilingual XLS-R-1b model in low-resource settings. Moreover, we study the effect of scaling on ASR performance by scaling our models up to 1B parameters. Our best model yields a state-of-the-art word error rate of 4.97% on the Turkish Broadcast News dataset. Models are available at https://huggingface.co/asafaya
翻译:虽然土耳其语言列在低资源语言中,但土耳其自动语音识别(ASR)文献相对陈旧。 在本文中,我们介绍了基于HuBERT的土耳其语言代表模式HuBERT-TR。HuBERT-TR在几个土耳其的ASR数据集中取得了最新成果。我们调查了土耳其语语言识别(HuBERT-TR)的预培训(HUBERT),其大规模数据来自在线资源。我们使用YouTube6500多小时的语音数据进行预培训(HuBERT-TR),这些数据在质量和类型上具有广泛的差异性。我们显示,语言特定模式优于其他培训前模式,即我们的土耳其模式HuBERT-TR/Base在低资源环境中比最先进的x10倍的多语言XLS-R-1b模型表现更好。此外,我们研究了通过将模型提升到1B参数来提升ASR性能的效果。我们的最佳模型在土耳其广播新闻数据集中产生4.97 %的状态、最先进的单词错误率。模型可在https://huphaceco.