This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.
翻译:本文介绍ESPnet2-TTS(E2E-TTS)的终端到终端文本到终端语音工具包。ESPnet2-TTS扩展了我们早先的版本ESPnet-TTS,增加了许多新的特征,包括:在飞行时灵活处理前,与神经蒸发器进行联合培训,以及最新TTS模型,扩展如全波E2E文本到波形模型,简化了培训管道,进一步提高TTS的性能。我们的食谱的统一设计使用户能够迅速复制最新的E2E-TTS结果。我们还在统一的Python界面中提供了许多预先培训的模型,用于推断,为用户提供了生成基线样本和构建演示材料的快速手段。与英国和日本公司进行的实验性评估表明,我们提供的模型综合了可与地面图象相近的超文本,实现了TTTS的状态性能。工具包可在https://github.com/espnet/espnetnet上在线查阅。