Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.
翻译:语言合成系统在个人助理、全球定位系统应用程序、屏幕阅读器和无障碍工具等不同应用中很受欢迎,但并非所有语言在语言合成资源和系统方面都处于同一水平,这项工作包括以新颖数据集的形式为巴西葡萄牙语创造公开资源,以及端至端语音合成的深层学习模式。这种数据集有一位发言者提供的10.5小时,其中带有RTISI-LA voccoder 的Tacotron 2模型显示最佳性能,实现了4.03MOS值,所得结果与涵盖英语和葡萄牙语最新艺术的相关工作相类似。