Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.
翻译:歌唱合成与歌唱转换技术已在生成自然的人声演唱方面取得了显著进展。然而,现有系统局限于人声音色,难以合成超出人类音域范围的声音,而此类需求在视频游戏、电影及虚拟角色等创意应用中日渐增长。本文提出非人声歌唱生成任务,涵盖非人声歌唱合成与非人声歌唱转换,作为一种通过机器学习生成具有非人声音色特征且音乐连贯的歌唱的新任务。该任务面临三大挑战:非人声歌唱数据稀缺、缺乏符号对齐信息,以及人声与非人声之间存在显著的音色差异。为应对这些挑战,我们提出了CartoonSing——一个统一框架,它整合了歌唱合成与转换技术,并搭建了人声与非人声歌唱生成之间的桥梁。CartoonSing采用两阶段流程:首先通过标注的人声歌唱数据训练乐谱表示编码器,随后利用音色感知声码器重建人声与非人声音频的波形。实验表明,CartoonSing能够成功生成非人声歌唱,泛化至新音色,并将传统歌唱合成与转换技术拓展至创造性的非人声歌唱生成领域。