The increasing availability of audio data on the internet lead to a multitude of datasets for development and training of text to speech applications, based on neural networks. Highly differing quality of voice, low sampling rates, lack of text normalization and disadvantageous alignment of audio samples to corresponding transcript sentences still limit the performance of deep neural networks trained on this task. Additionally, data resources in languages like German are still very limited. We introduce the "HUI-Audio-Corpus-German", a large, open-source dataset for TTS engines, created with a processing pipeline, which produces high quality audio to transcription alignments and decreases manual effort needed for creation.
翻译:在互联网上越来越多地提供音频数据导致大量数据集,用于根据神经网络开发和培训语言应用的文本,声音质量大不相同,取样率低,音频样本缺乏正统化以及音频样本与相应的笔录不协调,这仍然限制着就这项任务受过培训的深层神经网络的性能,此外,以德语等语言提供的数据资源仍然非常有限,我们为TTS引擎引入了“HUI-Audio-Corpus-German”这一大型的开放源数据集,该数据集是用加工管道创建的,通过加工管道产生高质量的音频与转录一致,并减少创建所需的人工努力。