At present, Text-to-speech (TTS) systems that are trained with high-quality transcribed speech data using end-to-end neural models can generate speech that is intelligible, natural, and closely resembles human speech. These models are trained with relatively large single-speaker professionally recorded audio, typically extracted from audiobooks. Meanwhile, due to the scarcity of freely available speech corpora of this kind, a larger gap exists in Arabic TTS research and development. Most of the existing freely available Arabic speech corpora are not suitable for TTS training as they contain multi-speaker casual speech with variations in recording conditions and quality, whereas the corpus curated for speech synthesis are generally small in size and not suitable for training state-of-the-art end-to-end models. In a move towards filling this gap in resources, we present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz. In this paper, we describe the process of corpus creation and provide details of corpus statistics and a comparison with existing resources. Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and illustrate the performance of the resulting systems via subjective and objective evaluations. The corpus will be made publicly available at www.clartts.com for research purposes, along with the baseline TTS systems demo.
翻译:目前,通过使用端到端神经模型进行高质量转录语音数据培训的文本到语音系统(TTS)目前,通过使用端到端神经模型进行高品质调音数据培训的文本到语音系统,可以产生易感性、自然和与人的语言非常相似的语音。这些模型经过相对大型的单声频专业录音培训,通常从声频书中提取。与此同时,由于缺少这种类型的免费语音公司,阿拉伯文TTS研究和发展中存在着更大的差距。现有的可自由获取的阿拉伯语语音公司大多数不适合TTS培训,因为它们包含多声频临时发言,在记录条件和质量方面各有差异,而为语音合成而整理的剧本一般规模较小,不适合培训最高级的、最专业的、最专业的、最专业的、最高级的音频模式。为了填补这一资源缺口,我们为经典的阿拉伯语文本到Speople(CLARTTS)提供了一套语音和现有TTTS系统。演讲将来自LibriVox 音频系统,然后从一个LiriVox 音盘上进行翻译和最后版本的版本的翻译,然后在12小时的STRTTTTTS 上进行整理,然后进行。</s>