This paper proposes a controllable singing voice synthesis system capable of generating expressive singing voice with two novel methodologies. First, a local style token module, which predicts frame-level style tokens from an input pitch and text sequence, is proposed to allow the singing voice system to control musical expression often unspecified in sheet music (e.g., breathing and intensity). Second, we propose a dual-path pitch encoder with a choice of two different pitch inputs: MIDI pitch sequence or f0 contour. Because the initial generation of a singing voice is usually executed by taking a MIDI pitch sequence, one can later extract an f0 contour from the generated singing voice and modify the f0 contour to a finer level as desired. Through quantitative and qualitative evaluations, we confirmed that the proposed model could control various musical expressions while not sacrificing the sound quality of the singing voice synthesis system.
翻译:本文提出一个可控的歌声合成系统,能够通过两种新颖的方法产生出发声的声音。 首先,一个本地风格的象征性模块,从输入声道和文本序列中预测框架级风格符号,以允许歌声系统控制音乐表达,通常在床单音乐(如呼吸和强度)中未注明。 其次,我们提出一个双路径的声调编码器,选择两种不同的音调输入:MIDI 音调序列或f0轮廓。由于最初一代的歌声通常通过采用 MIDI 音调序列来执行,人们可以稍后从生成的歌声中提取一个f0轮廓,并按需要将f0轮廓修改为更细的音调。通过定量和定性评估,我们确认拟议的模式可以控制各种音乐表达,同时不牺牲歌声合成系统的音质质量。