This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges, and discuss important future directions. To demonstrate the reliability of our dataset, we built baseline end-to-end TTS models and evaluated them using the subjective mean opinion score (MOS) measure. Evaluation results show that the best TTS models trained on our dataset achieve MOS above 4 for both speakers, which makes them applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available.
翻译:本文介绍了哈萨克语的高质量开放源语言综合综合数据集,哈萨克语是全世界1 300多万人使用的一种低资源语言,该数据集由两名专业(女性和男性)发言人使用的大约93小时抄录录音组成,这是为在学术界和工业界推广哈萨克文本对语音应用而开发的第一个公开的大型数据集。在本文件中,我们通过描述数据集开发程序和面临的挑战,分享我们的经验,并讨论重要的未来方向。为了证明我们数据集的可靠性,我们建立了终端至终端TTS模型,并利用主观平均意见评分(MOS)衡量这些模型。评价结果显示,在我们的数据集上培训过的最好的TTS模型为两位发言人实现了超过4兆S的MOS,这些模型都可用于实际使用。数据集、培训配方和预先培训的TTS模型是免费的。