哈萨克TTS: 开放源码的哈萨克文本到语音合成数据集 (KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset)

This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges, and discuss important future directions. To demonstrate the reliability of our dataset, we built baseline end-to-end TTS models and evaluated them using the subjective mean opinion score (MOS) measure. Evaluation results show that the best TTS models trained on our dataset achieve MOS above 4 for both speakers, which makes them applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available.

翻译：本文介绍了哈萨克语的高质量开放源语言综合综合数据集,哈萨克语是全世界1 300多万人使用的一种低资源语言,该数据集由两名专业(女性和男性)发言人使用的大约93小时抄录录音组成,这是为在学术界和工业界推广哈萨克文本对语音应用而开发的第一个公开的大型数据集。在本文件中,我们通过描述数据集开发程序和面临的挑战,分享我们的经验,并讨论重要的未来方向。为了证明我们数据集的可靠性,我们建立了终端至终端TTS模型,并利用主观平均意见评分(MOS)衡量这些模型。评价结果显示,在我们的数据集上培训过的最好的TTS模型为两位发言人实现了超过4兆S的MOS,这些模型都可用于实际使用。数据集、培训配方和预先培训的TTS模型是免费的。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。