Neural network based end-to-end Text-to-Speech (TTS) has greatly improved the quality of synthesized speech. While how to use massive spontaneous speech without transcription efficiently still remains an open problem. In this paper, we propose MHTTS, a fast multi-speaker TTS system that is robust to transcription errors and speaking style speech data. Specifically, we introduce a multi-head model and transfer text information from high-quality corpus with manual transcription to spontaneous speech with imperfectly recognized transcription by jointly training them. MHTTS has three advantages: 1) Our system synthesizes better quality multi-speaker voice with faster inference speed. 2) Our system is capable of transferring correct text information to data with imperfect transcription, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). 3) Our system can utilize massive real spontaneous speech with imperfect transcription and synthesize expressive voice.
翻译:基于神经网络端到端的文本到语音(TTS)大大提高了合成语音的质量。 如何高效地使用大规模自发语音而不进行笔录仍然是一个尚未解决的问题。 在本文中,我们建议采用快速多发式TTS系统,这是一个快速的多发式TTS系统,对抄录错误和语音风格语音数据具有很强的功能。 具体地说,我们引入了多发式模型,并通过联合培训将高品质的文本用手工抄录方式转换成自发语音,而不尽人意的抄录。 MHTTS有三个优点:(1) 我们的系统以更快的推论速度合成质量更好的多发式语音。 (2) 我们的系统能够将正确文本信息转换为不完善的抄录、使用腐败模拟的或由自动语音识别器提供的数据。 (3) 我们的系统可以使用不完善的抄录和合成表达语音的大规模自发式语音。