Modern text-to-speech (TTS) systems use deep learning to synthesize speech increasingly approaching human quality, but they require a database of high quality audio-text sentence pairs for training. Malayalam, the official language of the Indian state of Kerala and spoken by 35+ million people, is a low resource language in terms of available corpora for TTS systems. In this paper, we present IMaSC, a Malayalam text and speech corpora containing approximately 50 hours of recorded speech. With 8 speakers and a total of 34,473 text-audio pairs, IMaSC is larger than every other publicly available alternative. We evaluated the database by using it to train TTS models for each speaker based on a modern deep learning architecture. Via subjective evaluation, we show that our models perform significantly better in terms of naturalness compared to previous studies and publicly available models, with an average mean opinion score of 4.50, indicating that the synthesized speech is close to human quality.
翻译:现代文本到语音系统(TTS)使用深层次的学习来合成越来越接近人类质量的言语,但是它们需要一个高质量的音频词句数据库,用于培训。马来亚拉姆语是印度喀拉拉邦官方语言,有3,500万人使用,就TS系统而言,是一种低资源语言。在本文中,我们介绍了马亚拉姆语文本和语音组合,包含大约50小时的录音演讲。有8位发言者和总共34,473对文本-音频配对,IMaSC比所有其他公开的替代语言都大。我们评估了该数据库,利用它为每个发言者培训基于现代深层学习结构的TTS模型。通过主观评价,我们显示我们的模型在自然性方面比以往的研究和公开的模型表现要好得多,平均平均评分为4.50,表明合成的言论接近人的质量。