End-to-end text-to-speech (TTS) systems have been developed for European languages like English and Spanish with state-of-the-art speech quality, prosody, and naturalness. However, development of end-to-end TTS for Indian languages is lagging behind in terms of quality. The challenges involved in such a task are: 1) scarcity of quality training data; 2) low efficiency during training and inference; 3) slow convergence in the case of large vocabulary size. In our work reported in this paper, we have investigated the use of fine-tuning the English-pretrained Tacotron2 model with limited Sanskrit data to synthesize natural sounding speech in Sanskrit in low resource settings. Our experiments show encouraging results, achieving an overall MOS of 3.38 from 37 evaluators with good Sanskrit spoken knowledge. This is really a very good result, considering the fact that the speech data we have used is of duration 2.5 hours only.
翻译:为英语和西班牙语等欧洲语言开发了终端到终端文本到语音系统(TTS),具有最先进的语言质量、手语和自然性,然而,为印度语言开发端到终端 TTS的质量在质量方面落后。这一任务涉及的挑战有:(1) 缺乏高质量的培训数据;(2) 培训和推论期间效率低;(2) 词汇大小大的情况下的趋同速度缓慢。在本文件中报告的工作中,我们调查了在低资源环境下使用精细调整英语预先训练的Tacotron2模型(只有有限的梵语数据)来合成梵语中自然声音的情况。我们的实验显示令人鼓舞的结果,从具有良好梵语口述知识的37名评价人员中取得了3.38个总体MOS。考虑到我们使用的语音数据只有2.5小时的时间,这确实是一个非常好的结果。