In multi-speaker speech synthesis, data from a number of speakers usually tends to have great diversity due to the fact that the speakers may differ largely in their ages, speaking styles, speeds, emotions, and so on. The diversity of data will lead to the one-to-many mapping problem \cite{Ren2020FastSpeech2F, Kumar2020FewSA}. It is important but challenging to improve the modeling capabilities for multi-speaker speech synthesis. To address the issue, this paper researches into the effective use of control information such as speaker and pitch which are differentiated from text-content information in our encoder-decoder framework: 1) Design a representation of harmonic structure of speech, called excitation spectrogram, from pitch and energy. The excitation spectrogrom is, along with the text-content, fed to the decoder to guide the learning of harmonics of mel-spectrogram. 2) Propose conditional gated LSTM (CGLSTM) whose input/output/forget gates are re-weighted by speaker embedding to control the flow of text-content information in the network. The experiments show significant reduction in reconstruction errors of mel-spectrogram in the training of multi-speaker generative model, and a great improvement is observed in the subjective evaluation of speaker adapted model, e.g, the Mean Opinion Score (MOS) of intelligibility increases by 0.81 points.
翻译:在多声音语音合成中,一些发言者提供的数据通常趋向于差异很大,因为发言者的年龄、语言风格、速度、情绪等等可能大不相同。数据的多样性将导致一对多种制图问题\cite{Ren2020FFestSpeech2F, Kumar2020FewSA}。提高多声音语音合成的建模能力固然重要,但具有挑战性。为解决这一问题,本文研究控制信息的有效利用,例如与我们编码-脱coder框架中的文本内容信息有差异的演讲者和音频:1) 设计一个调和式的语音结构,称为Excolence光谱,来自音和能量。振动光谱与文字内容一起,被反馈到解调器,以指导模型-光谱语言合成的调和。2 提出有条件的直径LSTM(CLSTM),其输入/输出/配置/配置门与我们的编码-解译框架框架中的文本内容有区别:(1) 语音图解析结构的调结构结构结构结构结构,由主言者在大幅的变动中进行文字变校程的变换,显示大网络的文本的变校程的校程的校正。