We propose U-Singer, the first multi-singer emotional singing voice synthesizer that expresses various levels of emotional intensity. During synthesizing singing voices according to the lyrics, pitch, and duration of the music score, U-Singer reflects singer characteristics and emotional intensity by adding variances in pitch, energy, and phoneme duration according to singer ID and emotional intensity. Representing all attributes by conditional residual embeddings in a single unified embedding space, U-Singer controls mutually correlated style attributes, minimizing interference. Additionally, we apply emotion embedding interpolation and extrapolation techniques that lead the model to learn a linear embedding space and allow the model to express emotional intensity levels not included in the training data. In experiments, U-Singer synthesized high-fidelity singing voices reflecting the singer ID and emotional intensity. The visualization of the unified embedding space exhibits that U-singer estimates the correct variations in pitch and energy highly correlated with the singer ID and emotional intensity level. The audio samples are presented at https://u-singer.github.io.
翻译:我们提议使用美国-辛格(U-Singer),这是第一个表达不同程度情感强度的多发情感歌曲合成器。在根据音乐评分的歌词、音调和持续时间对歌唱声音进行综合时,美国-辛格(U-Singer)反映歌手特点和情感强度的差异,根据歌手身份和情感强度增加音频、能量和电话长度的差异。用单一统一嵌入空间的有条件剩余嵌入来代表所有属性,美国-Singer(U-Singer)控制着相互关联的风格属性,尽量减少干扰。此外,我们运用情感嵌入和外推技术,引导模型学习线性嵌入空间,并允许模型表达培训数据中不包括的情感强度。在实验中,美国-Singer合成了反映歌手身份和情感强度的高真知性歌声。U-Singer(U-Singer)估计音频和能源高度关联度与歌手身份和情感强度水平的正确变化的视觉化空间展览。音频样本在https://u-singer.githubio上展示。